feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
471 lines
8.4 KiB
Markdown
471 lines
8.4 KiB
Markdown
# Architecture and Reduction
|
||
|
||
> ← Back to [README.md](../README.md)
|
||
|
||
This document describes the internal architecture of **Radixor** and the principles behind its **trie compilation and reduction model**.
|
||
|
||
It explains:
|
||
|
||
- how data flows from dictionary input to compiled trie
|
||
- how patch-command tries are structured
|
||
- how subtree reduction works
|
||
- how reduction modes affect behavior and size
|
||
|
||
|
||
|
||
## Overview
|
||
|
||
Radixor transforms dictionary data into an optimized runtime structure through three stages:
|
||
|
||
1. **Mutable construction**
|
||
2. **Reduction (canonicalization)**
|
||
3. **Compilation (freezing)**
|
||
|
||
```
|
||
Dictionary → Mutable trie → Reduced trie → Compiled trie
|
||
```
|
||
|
||
Each stage has a distinct purpose:
|
||
|
||
| Stage | Purpose | Structure |
|
||
|------------|----------------------------------|-------------------------|
|
||
| Build | Collect mappings | `MutableNode` |
|
||
| Reduction | Merge equivalent subtrees | `ReducedNode` |
|
||
| Compilation | Optimize for runtime lookup | `CompiledNode` |
|
||
|
||
|
||
|
||
## Core data model
|
||
|
||
### Patch-command trie
|
||
|
||
Radixor stores **patch commands** instead of stems directly.
|
||
|
||
- keys: word forms
|
||
- values: transformation commands
|
||
- structure: trie (prefix tree)
|
||
|
||
At runtime:
|
||
|
||
1. the word is traversed through the trie
|
||
2. a patch command is retrieved
|
||
3. the patch is applied to reconstruct the stem
|
||
|
||
|
||
|
||
## Stage 1: Mutable construction
|
||
|
||
The builder (`FrequencyTrie.Builder`) constructs a trie using:
|
||
|
||
- `MutableNode`
|
||
- maps of children (`char → node`)
|
||
- maps of value counts (`value → frequency`)
|
||
|
||
Characteristics:
|
||
|
||
- insertion-order preserving
|
||
- mutable
|
||
- optimized for building, not querying
|
||
|
||
Example structure:
|
||
|
||
```
|
||
g
|
||
└─ n
|
||
└─ i
|
||
└─ n
|
||
└─ n
|
||
└─ u
|
||
└─ r
|
||
└─ (values: {
|
||
"<patch-command-1>": 3,
|
||
"<patch-command-2>": 1
|
||
})
|
||
```
|
||
|
||
This example represents the word "running", stored in reversed form.
|
||
|
||
- each edge corresponds to one character of the word
|
||
- the path is traversed from the end of the word toward the beginning
|
||
- the terminal node stores one or more patch commands together with their local frequencies
|
||
|
||
The values represent transformations from the word form to candidate stems, and the counts indicate how often each mapping was observed during construction.
|
||
|
||
Note: Radixor stores word forms in reversed order so that suffix-based transformations can be matched efficiently in a trie.
|
||
|
||
|
||
## Local value summary
|
||
|
||
Before reduction, each node is summarized using `LocalValueSummary`.
|
||
|
||
It computes:
|
||
|
||
- ordered values (by frequency)
|
||
- aligned counts
|
||
- total frequency
|
||
- dominant value (if any)
|
||
- second-best value
|
||
|
||
This summary is critical for:
|
||
|
||
- deterministic ordering
|
||
- reduction decisions
|
||
- dominance evaluation
|
||
|
||
|
||
|
||
## Stage 2: Reduction (canonicalization)
|
||
|
||
Reduction is the process of merging **semantically equivalent subtrees**.
|
||
|
||
### Why reduction exists
|
||
|
||
Without reduction:
|
||
|
||
- trie size grows linearly with input data
|
||
- repeated patterns are duplicated
|
||
|
||
With reduction:
|
||
|
||
- identical subtrees are shared
|
||
- memory footprint is reduced
|
||
- binary output becomes smaller
|
||
|
||
|
||
|
||
## Reduction signature
|
||
|
||
Each subtree is represented by a **ReductionSignature**.
|
||
|
||
A signature consists of:
|
||
|
||
1. **local descriptor** (node semantics)
|
||
2. **child descriptors** (structure)
|
||
|
||
```
|
||
Signature = (LocalDescriptor, SortedChildDescriptors)
|
||
```
|
||
|
||
Two subtrees are merged if their signatures are equal.
|
||
|
||
|
||
|
||
## Local descriptors
|
||
|
||
The local descriptor encodes how values at a node are interpreted.
|
||
|
||
Radixor supports three descriptor types:
|
||
|
||
### 1. Ranked descriptor
|
||
|
||
Preserves:
|
||
|
||
- full ordering of values (`getAll()`)
|
||
|
||
Uses:
|
||
|
||
- ordered value list
|
||
|
||
Best for:
|
||
|
||
- correctness
|
||
- deterministic multi-result behavior
|
||
|
||
|
||
|
||
### 2. Unordered descriptor
|
||
|
||
Preserves:
|
||
|
||
- only membership (set of values)
|
||
|
||
Ignores:
|
||
|
||
- ordering differences
|
||
|
||
Best for:
|
||
|
||
- higher compression
|
||
- use cases where ordering is irrelevant
|
||
|
||
|
||
|
||
### 3. Dominant descriptor
|
||
|
||
Preserves:
|
||
|
||
- only the dominant value (`get()`)
|
||
|
||
Condition:
|
||
|
||
- dominant value must satisfy thresholds:
|
||
- minimum percentage
|
||
- ratio over second-best
|
||
|
||
Fallback:
|
||
|
||
- if dominance is not strong enough → ranked descriptor is used
|
||
|
||
Best for:
|
||
|
||
- maximum compression
|
||
- single-result workflows
|
||
|
||
|
||
|
||
## Child descriptors
|
||
|
||
Each child is represented as:
|
||
|
||
```
|
||
(edge character, child signature)
|
||
```
|
||
|
||
Children are sorted by edge character to ensure:
|
||
|
||
- deterministic signatures
|
||
- stable equality comparisons
|
||
|
||
|
||
|
||
## Reduction context
|
||
|
||
`ReductionContext` maintains:
|
||
|
||
- mapping: `ReductionSignature → ReducedNode`
|
||
- canonical instances of subtrees
|
||
|
||
Workflow:
|
||
|
||
1. compute signature
|
||
2. check if already exists
|
||
3. reuse existing node or create new one
|
||
|
||
This ensures:
|
||
|
||
- structural sharing
|
||
- no duplicate equivalent subtrees
|
||
|
||
|
||
|
||
## Reduced nodes
|
||
|
||
`ReducedNode` represents:
|
||
|
||
- canonical subtree
|
||
- aggregated value counts
|
||
- canonical children
|
||
|
||
It supports:
|
||
|
||
- merging local counts
|
||
- verifying structural consistency
|
||
|
||
At this stage:
|
||
|
||
- structure is canonical
|
||
- still mutable (internally)
|
||
|
||
|
||
|
||
## Stage 3: Compilation (freezing)
|
||
|
||
The reduced trie is converted into a **CompiledNode** structure.
|
||
|
||
### CompiledNode characteristics
|
||
|
||
- immutable
|
||
- array-based storage
|
||
- optimized for fast lookup
|
||
|
||
Fields:
|
||
|
||
- `char[] edgeLabels`
|
||
- `CompiledNode[] children`
|
||
- `V[] orderedValues`
|
||
- `int[] orderedCounts`
|
||
|
||
|
||
|
||
## Lookup algorithm
|
||
|
||
Runtime lookup:
|
||
|
||
1. traverse trie using `edgeLabels` (matching characters from the end of the word toward the beginning)
|
||
2. binary search per node
|
||
3. retrieve values
|
||
4. apply patch command
|
||
|
||
Properties:
|
||
|
||
- O(length of word)
|
||
- low memory overhead
|
||
- minimal memory allocation during lookup; patch application produces the resulting string
|
||
|
||
|
||
## Deterministic ordering
|
||
|
||
Value ordering is deterministic and stable:
|
||
|
||
1. higher frequency first
|
||
2. shorter string first
|
||
3. lexicographically smaller
|
||
4. insertion order
|
||
|
||
This guarantees:
|
||
|
||
- reproducible builds
|
||
- stable query results
|
||
- predictable ranking
|
||
|
||
|
||
|
||
## Reduction modes
|
||
|
||
Reduction modes control how local descriptors are chosen.
|
||
|
||
### Ranked mode
|
||
|
||
```
|
||
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
```
|
||
|
||
- preserves full semantics
|
||
- safest option
|
||
- recommended default
|
||
|
||
|
||
|
||
### Unordered mode
|
||
|
||
```
|
||
MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS
|
||
```
|
||
|
||
- ignores ordering
|
||
- higher compression
|
||
- slightly weaker semantics
|
||
|
||
|
||
|
||
### Dominant mode
|
||
|
||
```
|
||
MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS
|
||
```
|
||
|
||
- keeps only dominant result
|
||
- highest compression
|
||
- may lose alternative candidates
|
||
|
||
|
||
|
||
## Trade-offs
|
||
|
||
| Aspect | Ranked | Unordered | Dominant |
|
||
|---------------|--------|----------|----------|
|
||
| Compression | Medium | High | Highest |
|
||
| Accuracy | High | Medium | Lower |
|
||
| getAll() | Full | Partial | Limited |
|
||
| get() | Exact | Exact | Heuristic|
|
||
|
||
|
||
|
||
## Deserialization model
|
||
|
||
Binary loading uses:
|
||
|
||
- `NodeData` as intermediate representation
|
||
- reconstruction of `CompiledNode`
|
||
|
||
This separates:
|
||
|
||
- I/O format
|
||
- in-memory structure
|
||
|
||
|
||
|
||
## Why this architecture works
|
||
|
||
Radixor achieves:
|
||
|
||
### Compactness
|
||
|
||
- subtree sharing
|
||
- efficient encoding
|
||
- compressed binary output
|
||
|
||
### Performance
|
||
|
||
- array-based lookup
|
||
- no runtime reduction
|
||
- minimal branching
|
||
|
||
### Flexibility
|
||
|
||
- configurable reduction strategies
|
||
- multiple result support
|
||
- dictionary-driven behavior
|
||
|
||
### Determinism
|
||
|
||
- stable ordering
|
||
- canonical signatures
|
||
- reproducible builds
|
||
|
||
|
||
|
||
## Design philosophy
|
||
|
||
The architecture reflects a few key principles:
|
||
|
||
- separate build-time complexity from runtime simplicity
|
||
- encode semantics explicitly (not implicitly in code)
|
||
- favor deterministic behavior over heuristic shortcuts
|
||
- allow controlled trade-offs between size and fidelity
|
||
|
||
|
||
|
||
## When to tune reduction
|
||
|
||
You should consider changing reduction mode when:
|
||
|
||
- binary size is too large
|
||
- memory footprint must be minimized
|
||
- only single-result stemming is needed
|
||
|
||
Otherwise:
|
||
|
||
**use ranked mode by default**
|
||
|
||
|
||
|
||
## Next steps
|
||
|
||
- [Programmatic usage](programmatic-usage.md)
|
||
- [CLI compilation](cli-compilation.md)
|
||
- [Dictionary format](dictionary-format.md)
|
||
|
||
|
||
|
||
## Summary
|
||
|
||
Radixor’s architecture is built around:
|
||
|
||
- patch-command tries
|
||
- canonical subtree reduction
|
||
- immutable compiled structures
|
||
|
||
This design allows the system to remain:
|
||
|
||
- fast
|
||
- compact
|
||
- deterministic
|
||
- adaptable
|
||
|
||
while still supporting advanced use cases such as:
|
||
|
||
- ambiguity-aware stemming
|
||
- dictionary evolution
|
||
- controlled trade-offs between size and behavior
|