docs: improve README, MkDocs content, branding assets, and site polish
This commit is contained in:
@@ -1,468 +1,52 @@
|
||||
# Architecture and Reduction
|
||||
|
||||
This document describes the internal architecture of **Radixor** and the principles behind its **trie compilation and reduction model**.
|
||||
This section explains how **Radixor** turns textual dictionary input into a compact compiled stemmer and how reduction affects the semantics preserved in the final runtime artifact.
|
||||
|
||||
It explains:
|
||||
Radixor is easiest to understand when separated into two related concerns:
|
||||
|
||||
- how data flows from dictionary input to compiled trie
|
||||
- how patch-command tries are structured
|
||||
- how subtree reduction works
|
||||
- how reduction modes affect behavior and size
|
||||
- **architecture**: what structures exist, how data moves through them, and what runtime lookup actually does,
|
||||
- **reduction semantics**: what it means for two subtrees to be considered equivalent and how that choice affects `get()` and `getAll()` behavior.
|
||||
|
||||
## The short version
|
||||
|
||||
Radixor does not keep a large flat table of final stems. Instead, it converts dictionary entries into **patch commands**, stores them in a trie, reduces equivalent subtrees, and freezes the result into an immutable compiled structure.
|
||||
|
||||
## Overview
|
||||
The build-time flow is:
|
||||
|
||||
Radixor transforms dictionary data into an optimized runtime structure through three stages:
|
||||
|
||||
1. **Mutable construction**
|
||||
2. **Reduction (canonicalization)**
|
||||
3. **Compilation (freezing)**
|
||||
|
||||
```
|
||||
Dictionary → Mutable trie → Reduced trie → Compiled trie
|
||||
```text
|
||||
Dictionary -> Mutable trie -> Reduced trie -> Compiled trie
|
||||
```
|
||||
|
||||
Each stage has a distinct purpose:
|
||||
At runtime, the compiled trie does not directly return the final stem string. It returns one or more stored patch commands for the addressed key, and those commands are then applied to the original input word.
|
||||
|
||||
| Stage | Purpose | Structure |
|
||||
|------------|----------------------------------|-------------------------|
|
||||
| Build | Collect mappings | `MutableNode` |
|
||||
| Reduction | Merge equivalent subtrees | `ReducedNode` |
|
||||
| Compilation | Optimize for runtime lookup | `CompiledNode` |
|
||||
## Why this matters
|
||||
|
||||
This design gives Radixor several practical properties at once:
|
||||
|
||||
- compact deployable artifacts,
|
||||
- deterministic runtime behavior,
|
||||
- support for both preferred and multiple candidate results,
|
||||
- separation of preparation-time complexity from runtime lookup.
|
||||
|
||||
## Core data model
|
||||
It also explains why a large source dictionary can be transformed into a much smaller compiled artifact without discarding the operational behavior that matters to the caller.
|
||||
|
||||
### Patch-command trie
|
||||
## Reading guide
|
||||
|
||||
Radixor stores **patch commands** instead of stems directly.
|
||||
Use the following pages depending on what you need to understand:
|
||||
|
||||
- keys: word forms
|
||||
- values: transformation commands
|
||||
- structure: trie (prefix tree)
|
||||
- [Architecture](architecture.md) explains the data flow, core structures, patch-command lookup model, and why the compiled trie is efficient at runtime.
|
||||
- [Reduction Semantics](reduction-semantics.md) explains how subtree equivalence is defined, what ranked, unordered, and dominant reduction preserve, and how those choices affect observable lookup behavior.
|
||||
|
||||
At runtime:
|
||||
## Recommended reading order
|
||||
|
||||
1. the word is traversed through the trie
|
||||
2. a patch command is retrieved
|
||||
3. the patch is applied to reconstruct the stem
|
||||
For most readers, the best order is:
|
||||
|
||||
1. [Architecture](architecture.md)
|
||||
2. [Reduction Semantics](reduction-semantics.md)
|
||||
|
||||
## Related documentation
|
||||
|
||||
## Stage 1: Mutable construction
|
||||
|
||||
The builder (`FrequencyTrie.Builder`) constructs a trie using:
|
||||
|
||||
- `MutableNode`
|
||||
- maps of children (`char → node`)
|
||||
- maps of value counts (`value → frequency`)
|
||||
|
||||
Characteristics:
|
||||
|
||||
- insertion-order preserving
|
||||
- mutable
|
||||
- optimized for building, not querying
|
||||
|
||||
Example structure:
|
||||
|
||||
```
|
||||
g
|
||||
└─ n
|
||||
└─ i
|
||||
└─ n
|
||||
└─ n
|
||||
└─ u
|
||||
└─ r
|
||||
└─ (values: {
|
||||
"<patch-command-1>": 3,
|
||||
"<patch-command-2>": 1
|
||||
})
|
||||
```
|
||||
|
||||
This example represents the word "running", stored in reversed form.
|
||||
|
||||
- each edge corresponds to one character of the word
|
||||
- the path is traversed from the end of the word toward the beginning
|
||||
- the terminal node stores one or more patch commands together with their local frequencies
|
||||
|
||||
The values represent transformations from the word form to candidate stems, and the counts indicate how often each mapping was observed during construction.
|
||||
|
||||
Note: Radixor stores word forms in reversed order so that suffix-based transformations can be matched efficiently in a trie.
|
||||
|
||||
|
||||
## Local value summary
|
||||
|
||||
Before reduction, each node is summarized using `LocalValueSummary`.
|
||||
|
||||
It computes:
|
||||
|
||||
- ordered values (by frequency)
|
||||
- aligned counts
|
||||
- total frequency
|
||||
- dominant value (if any)
|
||||
- second-best value
|
||||
|
||||
This summary is critical for:
|
||||
|
||||
- deterministic ordering
|
||||
- reduction decisions
|
||||
- dominance evaluation
|
||||
|
||||
|
||||
|
||||
## Stage 2: Reduction (canonicalization)
|
||||
|
||||
Reduction is the process of merging **semantically equivalent subtrees**.
|
||||
|
||||
### Why reduction exists
|
||||
|
||||
Without reduction:
|
||||
|
||||
- trie size grows linearly with input data
|
||||
- repeated patterns are duplicated
|
||||
|
||||
With reduction:
|
||||
|
||||
- identical subtrees are shared
|
||||
- memory footprint is reduced
|
||||
- binary output becomes smaller
|
||||
|
||||
|
||||
|
||||
## Reduction signature
|
||||
|
||||
Each subtree is represented by a **ReductionSignature**.
|
||||
|
||||
A signature consists of:
|
||||
|
||||
1. **local descriptor** (node semantics)
|
||||
2. **child descriptors** (structure)
|
||||
|
||||
```
|
||||
Signature = (LocalDescriptor, SortedChildDescriptors)
|
||||
```
|
||||
|
||||
Two subtrees are merged if their signatures are equal.
|
||||
|
||||
|
||||
|
||||
## Local descriptors
|
||||
|
||||
The local descriptor encodes how values at a node are interpreted.
|
||||
|
||||
Radixor supports three descriptor types:
|
||||
|
||||
### 1. Ranked descriptor
|
||||
|
||||
Preserves:
|
||||
|
||||
- full ordering of values (`getAll()`)
|
||||
|
||||
Uses:
|
||||
|
||||
- ordered value list
|
||||
|
||||
Best for:
|
||||
|
||||
- correctness
|
||||
- deterministic multi-result behavior
|
||||
|
||||
|
||||
|
||||
### 2. Unordered descriptor
|
||||
|
||||
Preserves:
|
||||
|
||||
- only membership (set of values)
|
||||
|
||||
Ignores:
|
||||
|
||||
- ordering differences
|
||||
|
||||
Best for:
|
||||
|
||||
- higher compression
|
||||
- use cases where ordering is irrelevant
|
||||
|
||||
|
||||
|
||||
### 3. Dominant descriptor
|
||||
|
||||
Preserves:
|
||||
|
||||
- only the dominant value (`get()`)
|
||||
|
||||
Condition:
|
||||
|
||||
- dominant value must satisfy thresholds:
|
||||
- minimum percentage
|
||||
- ratio over second-best
|
||||
|
||||
Fallback:
|
||||
|
||||
- if dominance is not strong enough → ranked descriptor is used
|
||||
|
||||
Best for:
|
||||
|
||||
- maximum compression
|
||||
- single-result workflows
|
||||
|
||||
|
||||
|
||||
## Child descriptors
|
||||
|
||||
Each child is represented as:
|
||||
|
||||
```
|
||||
(edge character, child signature)
|
||||
```
|
||||
|
||||
Children are sorted by edge character to ensure:
|
||||
|
||||
- deterministic signatures
|
||||
- stable equality comparisons
|
||||
|
||||
|
||||
|
||||
## Reduction context
|
||||
|
||||
`ReductionContext` maintains:
|
||||
|
||||
- mapping: `ReductionSignature → ReducedNode`
|
||||
- canonical instances of subtrees
|
||||
|
||||
Workflow:
|
||||
|
||||
1. compute signature
|
||||
2. check if already exists
|
||||
3. reuse existing node or create new one
|
||||
|
||||
This ensures:
|
||||
|
||||
- structural sharing
|
||||
- no duplicate equivalent subtrees
|
||||
|
||||
|
||||
|
||||
## Reduced nodes
|
||||
|
||||
`ReducedNode` represents:
|
||||
|
||||
- canonical subtree
|
||||
- aggregated value counts
|
||||
- canonical children
|
||||
|
||||
It supports:
|
||||
|
||||
- merging local counts
|
||||
- verifying structural consistency
|
||||
|
||||
At this stage:
|
||||
|
||||
- structure is canonical
|
||||
- still mutable (internally)
|
||||
|
||||
|
||||
|
||||
## Stage 3: Compilation (freezing)
|
||||
|
||||
The reduced trie is converted into a **CompiledNode** structure.
|
||||
|
||||
### CompiledNode characteristics
|
||||
|
||||
- immutable
|
||||
- array-based storage
|
||||
- optimized for fast lookup
|
||||
|
||||
Fields:
|
||||
|
||||
- `char[] edgeLabels`
|
||||
- `CompiledNode[] children`
|
||||
- `V[] orderedValues`
|
||||
- `int[] orderedCounts`
|
||||
|
||||
|
||||
|
||||
## Lookup algorithm
|
||||
|
||||
Runtime lookup:
|
||||
|
||||
1. traverse trie using `edgeLabels` (matching characters from the end of the word toward the beginning)
|
||||
2. binary search per node
|
||||
3. retrieve values
|
||||
4. apply patch command
|
||||
|
||||
Properties:
|
||||
|
||||
- O(length of word)
|
||||
- low memory overhead
|
||||
- minimal memory allocation during lookup; patch application produces the resulting string
|
||||
|
||||
|
||||
## Deterministic ordering
|
||||
|
||||
Value ordering is deterministic and stable:
|
||||
|
||||
1. higher frequency first
|
||||
2. shorter string first
|
||||
3. lexicographically smaller
|
||||
4. insertion order
|
||||
|
||||
This guarantees:
|
||||
|
||||
- reproducible builds
|
||||
- stable query results
|
||||
- predictable ranking
|
||||
|
||||
|
||||
|
||||
## Reduction modes
|
||||
|
||||
Reduction modes control how local descriptors are chosen.
|
||||
|
||||
### Ranked mode
|
||||
|
||||
```
|
||||
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
```
|
||||
|
||||
- preserves full semantics
|
||||
- safest option
|
||||
- recommended default
|
||||
|
||||
|
||||
|
||||
### Unordered mode
|
||||
|
||||
```
|
||||
MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS
|
||||
```
|
||||
|
||||
- ignores ordering
|
||||
- higher compression
|
||||
- slightly weaker semantics
|
||||
|
||||
|
||||
|
||||
### Dominant mode
|
||||
|
||||
```
|
||||
MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS
|
||||
```
|
||||
|
||||
- keeps only dominant result
|
||||
- highest compression
|
||||
- may lose alternative candidates
|
||||
|
||||
|
||||
|
||||
## Trade-offs
|
||||
|
||||
| Aspect | Ranked | Unordered | Dominant |
|
||||
|---------------|--------|----------|----------|
|
||||
| Compression | Medium | High | Highest |
|
||||
| Accuracy | High | Medium | Lower |
|
||||
| getAll() | Full | Partial | Limited |
|
||||
| get() | Exact | Exact | Heuristic|
|
||||
|
||||
|
||||
|
||||
## Deserialization model
|
||||
|
||||
Binary loading uses:
|
||||
|
||||
- `NodeData` as intermediate representation
|
||||
- reconstruction of `CompiledNode`
|
||||
|
||||
This separates:
|
||||
|
||||
- I/O format
|
||||
- in-memory structure
|
||||
|
||||
|
||||
|
||||
## Why this architecture works
|
||||
|
||||
Radixor achieves:
|
||||
|
||||
### Compactness
|
||||
|
||||
- subtree sharing
|
||||
- efficient encoding
|
||||
- compressed binary output
|
||||
|
||||
### Performance
|
||||
|
||||
- array-based lookup
|
||||
- no runtime reduction
|
||||
- minimal branching
|
||||
|
||||
### Flexibility
|
||||
|
||||
- configurable reduction strategies
|
||||
- multiple result support
|
||||
- dictionary-driven behavior
|
||||
|
||||
### Determinism
|
||||
|
||||
- stable ordering
|
||||
- canonical signatures
|
||||
- reproducible builds
|
||||
|
||||
|
||||
|
||||
## Design philosophy
|
||||
|
||||
The architecture reflects a few key principles:
|
||||
|
||||
- separate build-time complexity from runtime simplicity
|
||||
- encode semantics explicitly (not implicitly in code)
|
||||
- favor deterministic behavior over heuristic shortcuts
|
||||
- allow controlled trade-offs between size and fidelity
|
||||
|
||||
|
||||
|
||||
## When to tune reduction
|
||||
|
||||
You should consider changing reduction mode when:
|
||||
|
||||
- binary size is too large
|
||||
- memory footprint must be minimized
|
||||
- only single-result stemming is needed
|
||||
|
||||
Otherwise:
|
||||
|
||||
**use ranked mode by default**
|
||||
|
||||
|
||||
|
||||
## Next steps
|
||||
|
||||
- [Quick start](quick-start.md)
|
||||
- [Programmatic usage](programmatic-usage.md)
|
||||
- [CLI compilation](cli-compilation.md)
|
||||
- [Dictionary format](dictionary-format.md)
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Radixor’s architecture is built around:
|
||||
|
||||
- patch-command tries
|
||||
- canonical subtree reduction
|
||||
- immutable compiled structures
|
||||
|
||||
This design allows the system to remain:
|
||||
|
||||
- fast
|
||||
- compact
|
||||
- deterministic
|
||||
- adaptable
|
||||
|
||||
while still supporting advanced use cases such as:
|
||||
|
||||
- ambiguity-aware stemming
|
||||
- dictionary evolution
|
||||
- controlled trade-offs between size and behavior
|
||||
|
||||
Reference in New Issue
Block a user