feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
8.4 KiB
Architecture and Reduction
← Back to README.md
This document describes the internal architecture of Radixor and the principles behind its trie compilation and reduction model.
It explains:
- how data flows from dictionary input to compiled trie
- how patch-command tries are structured
- how subtree reduction works
- how reduction modes affect behavior and size
Overview
Radixor transforms dictionary data into an optimized runtime structure through three stages:
- Mutable construction
- Reduction (canonicalization)
- Compilation (freezing)
Dictionary → Mutable trie → Reduced trie → Compiled trie
Each stage has a distinct purpose:
| Stage | Purpose | Structure |
|---|---|---|
| Build | Collect mappings | MutableNode |
| Reduction | Merge equivalent subtrees | ReducedNode |
| Compilation | Optimize for runtime lookup | CompiledNode |
Core data model
Patch-command trie
Radixor stores patch commands instead of stems directly.
- keys: word forms
- values: transformation commands
- structure: trie (prefix tree)
At runtime:
- the word is traversed through the trie
- a patch command is retrieved
- the patch is applied to reconstruct the stem
Stage 1: Mutable construction
The builder (FrequencyTrie.Builder) constructs a trie using:
MutableNode- maps of children (
char → node) - maps of value counts (
value → frequency)
Characteristics:
- insertion-order preserving
- mutable
- optimized for building, not querying
Example structure:
g
└─ n
└─ i
└─ n
└─ n
└─ u
└─ r
└─ (values: {
"<patch-command-1>": 3,
"<patch-command-2>": 1
})
This example represents the word "running", stored in reversed form.
- each edge corresponds to one character of the word
- the path is traversed from the end of the word toward the beginning
- the terminal node stores one or more patch commands together with their local frequencies
The values represent transformations from the word form to candidate stems, and the counts indicate how often each mapping was observed during construction.
Note: Radixor stores word forms in reversed order so that suffix-based transformations can be matched efficiently in a trie.
Local value summary
Before reduction, each node is summarized using LocalValueSummary.
It computes:
- ordered values (by frequency)
- aligned counts
- total frequency
- dominant value (if any)
- second-best value
This summary is critical for:
- deterministic ordering
- reduction decisions
- dominance evaluation
Stage 2: Reduction (canonicalization)
Reduction is the process of merging semantically equivalent subtrees.
Why reduction exists
Without reduction:
- trie size grows linearly with input data
- repeated patterns are duplicated
With reduction:
- identical subtrees are shared
- memory footprint is reduced
- binary output becomes smaller
Reduction signature
Each subtree is represented by a ReductionSignature.
A signature consists of:
- local descriptor (node semantics)
- child descriptors (structure)
Signature = (LocalDescriptor, SortedChildDescriptors)
Two subtrees are merged if their signatures are equal.
Local descriptors
The local descriptor encodes how values at a node are interpreted.
Radixor supports three descriptor types:
1. Ranked descriptor
Preserves:
- full ordering of values (
getAll())
Uses:
- ordered value list
Best for:
- correctness
- deterministic multi-result behavior
2. Unordered descriptor
Preserves:
- only membership (set of values)
Ignores:
- ordering differences
Best for:
- higher compression
- use cases where ordering is irrelevant
3. Dominant descriptor
Preserves:
- only the dominant value (
get())
Condition:
- dominant value must satisfy thresholds:
- minimum percentage
- ratio over second-best
Fallback:
- if dominance is not strong enough → ranked descriptor is used
Best for:
- maximum compression
- single-result workflows
Child descriptors
Each child is represented as:
(edge character, child signature)
Children are sorted by edge character to ensure:
- deterministic signatures
- stable equality comparisons
Reduction context
ReductionContext maintains:
- mapping:
ReductionSignature → ReducedNode - canonical instances of subtrees
Workflow:
- compute signature
- check if already exists
- reuse existing node or create new one
This ensures:
- structural sharing
- no duplicate equivalent subtrees
Reduced nodes
ReducedNode represents:
- canonical subtree
- aggregated value counts
- canonical children
It supports:
- merging local counts
- verifying structural consistency
At this stage:
- structure is canonical
- still mutable (internally)
Stage 3: Compilation (freezing)
The reduced trie is converted into a CompiledNode structure.
CompiledNode characteristics
- immutable
- array-based storage
- optimized for fast lookup
Fields:
char[] edgeLabelsCompiledNode[] childrenV[] orderedValuesint[] orderedCounts
Lookup algorithm
Runtime lookup:
- traverse trie using
edgeLabels(matching characters from the end of the word toward the beginning) - binary search per node
- retrieve values
- apply patch command
Properties:
- O(length of word)
- low memory overhead
- minimal memory allocation during lookup; patch application produces the resulting string
Deterministic ordering
Value ordering is deterministic and stable:
- higher frequency first
- shorter string first
- lexicographically smaller
- insertion order
This guarantees:
- reproducible builds
- stable query results
- predictable ranking
Reduction modes
Reduction modes control how local descriptors are chosen.
Ranked mode
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
- preserves full semantics
- safest option
- recommended default
Unordered mode
MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS
- ignores ordering
- higher compression
- slightly weaker semantics
Dominant mode
MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS
- keeps only dominant result
- highest compression
- may lose alternative candidates
Trade-offs
| Aspect | Ranked | Unordered | Dominant |
|---|---|---|---|
| Compression | Medium | High | Highest |
| Accuracy | High | Medium | Lower |
| getAll() | Full | Partial | Limited |
| get() | Exact | Exact | Heuristic |
Deserialization model
Binary loading uses:
NodeDataas intermediate representation- reconstruction of
CompiledNode
This separates:
- I/O format
- in-memory structure
Why this architecture works
Radixor achieves:
Compactness
- subtree sharing
- efficient encoding
- compressed binary output
Performance
- array-based lookup
- no runtime reduction
- minimal branching
Flexibility
- configurable reduction strategies
- multiple result support
- dictionary-driven behavior
Determinism
- stable ordering
- canonical signatures
- reproducible builds
Design philosophy
The architecture reflects a few key principles:
- separate build-time complexity from runtime simplicity
- encode semantics explicitly (not implicitly in code)
- favor deterministic behavior over heuristic shortcuts
- allow controlled trade-offs between size and fidelity
When to tune reduction
You should consider changing reduction mode when:
- binary size is too large
- memory footprint must be minimized
- only single-result stemming is needed
Otherwise:
use ranked mode by default
Next steps
Summary
Radixor’s architecture is built around:
- patch-command tries
- canonical subtree reduction
- immutable compiled structures
This design allows the system to remain:
- fast
- compact
- deterministic
- adaptable
while still supporting advanced use cases such as:
- ambiguity-aware stemming
- dictionary evolution
- controlled trade-offs between size and behavior