Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
2026-04-13 02:10:46 +02:00
parent 15248c92c9
commit 038514bad0
64 changed files with 190190 additions and 20 deletions
--- a/docs/architecture-and-reduction.md
+++ b/docs/architecture-and-reduction.md
@@ -0,0 +1,470 @@
+# Architecture and Reduction
+
+> ← Back to [README.md](../README.md)
+
+This document describes the internal architecture of **Radixor** and the principles behind its **trie compilation and reduction model**.
+
+It explains:
+
+- how data flows from dictionary input to compiled trie
+- how patch-command tries are structured
+- how subtree reduction works
+- how reduction modes affect behavior and size
+
+
+
+## Overview
+
+Radixor transforms dictionary data into an optimized runtime structure through three stages:
+
+1. **Mutable construction**
+2. **Reduction (canonicalization)**
+3. **Compilation (freezing)**
+
+```
+Dictionary → Mutable trie → Reduced trie → Compiled trie
+```
+
+Each stage has a distinct purpose:
+
+| Stage       | Purpose                         | Structure               |
+|------------|----------------------------------|-------------------------|
+| Build       | Collect mappings                 | `MutableNode`           |
+| Reduction   | Merge equivalent subtrees        | `ReducedNode`           |
+| Compilation | Optimize for runtime lookup      | `CompiledNode`          |
+
+
+
+## Core data model
+
+### Patch-command trie
+
+Radixor stores **patch commands** instead of stems directly.
+
+- keys: word forms
+- values: transformation commands
+- structure: trie (prefix tree)
+
+At runtime:
+
+1. the word is traversed through the trie
+2. a patch command is retrieved
+3. the patch is applied to reconstruct the stem
+
+
+
+## Stage 1: Mutable construction
+
+The builder (`FrequencyTrie.Builder`) constructs a trie using:
+
+- `MutableNode`
+- maps of children (`char → node`)
+- maps of value counts (`value → frequency`)
+
+Characteristics:
+
+- insertion-order preserving
+- mutable
+- optimized for building, not querying
+
+Example structure:
+
+```
+g
+ └─ n
+     └─ i
+         └─ n
+             └─ n
+                 └─ u
+                     └─ r
+                         └─ (values: {
+                               "<patch-command-1>": 3,
+                               "<patch-command-2>": 1
+                           })
+```
+
+This example represents the word "running", stored in reversed form.
+
+- each edge corresponds to one character of the word
+- the path is traversed from the end of the word toward the beginning
+- the terminal node stores one or more patch commands together with their local frequencies
+
+The values represent transformations from the word form to candidate stems, and the counts indicate how often each mapping was observed during construction.
+
+Note: Radixor stores word forms in reversed order so that suffix-based transformations can be matched efficiently in a trie.
+
+
+## Local value summary
+
+Before reduction, each node is summarized using `LocalValueSummary`.
+
+It computes:
+
+- ordered values (by frequency)
+- aligned counts
+- total frequency
+- dominant value (if any)
+- second-best value
+
+This summary is critical for:
+
+- deterministic ordering
+- reduction decisions
+- dominance evaluation
+
+
+
+## Stage 2: Reduction (canonicalization)
+
+Reduction is the process of merging **semantically equivalent subtrees**.
+
+### Why reduction exists
+
+Without reduction:
+
+- trie size grows linearly with input data
+- repeated patterns are duplicated
+
+With reduction:
+
+- identical subtrees are shared
+- memory footprint is reduced
+- binary output becomes smaller
+
+
+
+## Reduction signature
+
+Each subtree is represented by a **ReductionSignature**.
+
+A signature consists of:
+
+1. **local descriptor** (node semantics)
+2. **child descriptors** (structure)
+
+```
+Signature = (LocalDescriptor, SortedChildDescriptors)
+```
+
+Two subtrees are merged if their signatures are equal.
+
+
+
+## Local descriptors
+
+The local descriptor encodes how values at a node are interpreted.
+
+Radixor supports three descriptor types:
+
+### 1. Ranked descriptor
+
+Preserves:
+
+- full ordering of values (`getAll()`)
+
+Uses:
+
+- ordered value list
+
+Best for:
+
+- correctness
+- deterministic multi-result behavior
+
+
+
+### 2. Unordered descriptor
+
+Preserves:
+
+- only membership (set of values)
+
+Ignores:
+
+- ordering differences
+
+Best for:
+
+- higher compression
+- use cases where ordering is irrelevant
+
+
+
+### 3. Dominant descriptor
+
+Preserves:
+
+- only the dominant value (`get()`)
+
+Condition:
+
+- dominant value must satisfy thresholds:
+  - minimum percentage
+  - ratio over second-best
+
+Fallback:
+
+- if dominance is not strong enough → ranked descriptor is used
+
+Best for:
+
+- maximum compression
+- single-result workflows
+
+
+
+## Child descriptors
+
+Each child is represented as:
+
+```
+(edge character, child signature)
+```
+
+Children are sorted by edge character to ensure:
+
+- deterministic signatures
+- stable equality comparisons
+
+
+
+## Reduction context
+
+`ReductionContext` maintains:
+
+- mapping: `ReductionSignature → ReducedNode`
+- canonical instances of subtrees
+
+Workflow:
+
+1. compute signature
+2. check if already exists
+3. reuse existing node or create new one
+
+This ensures:
+
+- structural sharing
+- no duplicate equivalent subtrees
+
+
+
+## Reduced nodes
+
+`ReducedNode` represents:
+
+- canonical subtree
+- aggregated value counts
+- canonical children
+
+It supports:
+
+- merging local counts
+- verifying structural consistency
+
+At this stage:
+
+- structure is canonical
+- still mutable (internally)
+
+
+
+## Stage 3: Compilation (freezing)
+
+The reduced trie is converted into a **CompiledNode** structure.
+
+### CompiledNode characteristics
+
+- immutable
+- array-based storage
+- optimized for fast lookup
+
+Fields:
+
+- `char[] edgeLabels`
+- `CompiledNode[] children`
+- `V[] orderedValues`
+- `int[] orderedCounts`
+
+
+
+## Lookup algorithm
+
+Runtime lookup:
+
+1. traverse trie using `edgeLabels` (matching characters from the end of the word toward the beginning)
+2. binary search per node
+3. retrieve values
+4. apply patch command
+
+Properties:
+
+- O(length of word)
+- low memory overhead
+- minimal memory allocation during lookup; patch application produces the resulting string
+
+
+## Deterministic ordering
+
+Value ordering is deterministic and stable:
+
+1. higher frequency first
+2. shorter string first
+3. lexicographically smaller
+4. insertion order
+
+This guarantees:
+
+- reproducible builds
+- stable query results
+- predictable ranking
+
+
+
+## Reduction modes
+
+Reduction modes control how local descriptors are chosen.
+
+### Ranked mode
+
+```
+MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+```
+
+- preserves full semantics
+- safest option
+- recommended default
+
+
+
+### Unordered mode
+
+```
+MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS
+```
+
+- ignores ordering
+- higher compression
+- slightly weaker semantics
+
+
+
+### Dominant mode
+
+```
+MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS
+```
+
+- keeps only dominant result
+- highest compression
+- may lose alternative candidates
+
+
+
+## Trade-offs
+
+| Aspect        | Ranked | Unordered | Dominant |
+|---------------|--------|----------|----------|
+| Compression   | Medium | High     | Highest  |
+| Accuracy      | High   | Medium   | Lower    |
+| getAll()      | Full   | Partial  | Limited  |
+| get()         | Exact  | Exact    | Heuristic|
+
+
+
+## Deserialization model
+
+Binary loading uses:
+
+- `NodeData` as intermediate representation
+- reconstruction of `CompiledNode`
+
+This separates:
+
+- I/O format
+- in-memory structure
+
+
+
+## Why this architecture works
+
+Radixor achieves:
+
+### Compactness
+
+- subtree sharing
+- efficient encoding
+- compressed binary output
+
+### Performance
+
+- array-based lookup
+- no runtime reduction
+- minimal branching
+
+### Flexibility
+
+- configurable reduction strategies
+- multiple result support
+- dictionary-driven behavior
+
+### Determinism
+
+- stable ordering
+- canonical signatures
+- reproducible builds
+
+
+
+## Design philosophy
+
+The architecture reflects a few key principles:
+
+- separate build-time complexity from runtime simplicity
+- encode semantics explicitly (not implicitly in code)
+- favor deterministic behavior over heuristic shortcuts
+- allow controlled trade-offs between size and fidelity
+
+
+
+## When to tune reduction
+
+You should consider changing reduction mode when:
+
+- binary size is too large
+- memory footprint must be minimized
+- only single-result stemming is needed
+
+Otherwise:
+
+**use ranked mode by default**
+
+
+
+## Next steps
+
+- [Programmatic usage](programmatic-usage.md)
+- [CLI compilation](cli-compilation.md)
+- [Dictionary format](dictionary-format.md)
+
+
+
+## Summary
+
+Radixor’s architecture is built around:
+
+- patch-command tries
+- canonical subtree reduction
+- immutable compiled structures
+
+This design allows the system to remain:
+
+- fast
+- compact
+- deterministic
+- adaptable
+
+while still supporting advanced use cases such as:
+
+- ambiguity-aware stemming
+- dictionary evolution
+- controlled trade-offs between size and behavior
--- a/docs/built-in-languages.md
+++ b/docs/built-in-languages.md
@@ -0,0 +1,252 @@
+# Built-in Languages
+
+> ← Back to [README.md](../README.md)
+
+Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
+
+These built-in resources are useful for:
+
+- quick integration
+- testing and evaluation
+- reference behavior
+- prototyping search pipelines
+
+
+
+## Overview
+
+Bundled dictionaries are exposed through:
+
+```java
+StemmerPatchTrieLoader.Language
+```
+
+They are packaged with the library and loaded from the classpath.
+
+
+
+## Supported languages
+
+The following language identifiers are currently available:
+
+| Language | Enum constant     | Description                  |
+|----------|------------------|------------------------------|
+| Danish   | `DA_DK`          | Danish                       |
+| German   | `DE_DE`          | German                       |
+| Spanish  | `ES_ES`          | Spanish                      |
+| French   | `FR_FR`          | French                       |
+| Italian  | `IT_IT`          | Italian                      |
+| Dutch    | `NL_NL`          | Dutch                        |
+| Norwegian| `NO_NO`          | Norwegian                    |
+| Portuguese| `PT_PT`         | Portuguese                   |
+| Russian  | `RU_RU`          | Russian                      |
+| Swedish  | `SV_SE`          | Swedish                      |
+| English  | `US_UK`          | Standard English             |
+| English  | `US_UK_PROFI`    | Extended English dictionary  |
+
+
+
+## Basic usage
+
+Load a bundled stemmer:
+
+```java
+import java.io.IOException;
+
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+public final class BuiltInExample {
+
+    public static void main(String[] args) throws IOException {
+        FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+                StemmerPatchTrieLoader.Language.US_UK_PROFI,
+                true,
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+        );
+    }
+}
+```
+
+
+
+## Example: stemming with `US_UK_PROFI`
+
+```java
+import java.io.IOException;
+
+import org.egothor.stemmer.*;
+
+public final class EnglishExample {
+
+    public static void main(String[] args) throws IOException {
+        FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+                StemmerPatchTrieLoader.Language.US_UK_PROFI,
+                true,
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+        );
+
+        String word = "running";
+        String patch = trie.get(word);
+        String stem = PatchCommandEncoder.apply(word, patch);
+
+        System.out.println(word + " -> " + stem);
+    }
+}
+```
+
+
+
+## `US_UK` vs `US_UK_PROFI`
+
+### `US_UK`
+
+* smaller dictionary
+* faster load time
+* suitable for lightweight use cases
+
+### `US_UK_PROFI`
+
+* larger and more complete dataset
+* better coverage of word forms
+* improved stemming quality
+* slightly larger memory footprint
+
+### Recommendation
+
+Use:
+
+````
+US_UK_PROFI
+```
+
+for most applications unless memory constraints are strict.
+
+
+
+## How bundled dictionaries are loaded
+
+Internally:
+
+- dictionaries are stored as text resources
+- parsed using `StemmerDictionaryParser`
+- compiled into a trie at load time
+
+This means:
+
+- first load includes parsing + compilation cost
+- subsequent usage is fast
+
+
+
+## When to use bundled languages
+
+Bundled dictionaries are suitable when:
+
+- you need quick results without preparing custom data
+- you are prototyping or experimenting
+- your language requirements match the provided datasets
+
+
+
+## When to use custom dictionaries
+
+You should prefer custom dictionaries when:
+
+- domain-specific vocabulary is important
+- accuracy requirements are high
+- you need full control over stemming behavior
+
+Typical examples:
+
+- technical terminology
+- product catalogs
+- biomedical text
+- legal or financial language
+
+
+
+## Production recommendation
+
+For production systems:
+
+1. Load a bundled dictionary
+2. Extend it with domain-specific terms (optional)
+3. Compile it into a binary `.radixor.gz` file
+4. Deploy the compiled artifact
+5. Load it using `loadBinary(...)`
+
+This avoids:
+
+- runtime parsing overhead
+- repeated compilation
+- startup latency
+
+
+
+## Example workflow
+
+```java
+// 1. Load bundled dictionary
+FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
+        StemmerPatchTrieLoader.Language.US_UK_PROFI,
+        true,
+        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+);
+
+// 2. Modify (optional)
+FrequencyTrie.Builder<String> builder =
+        FrequencyTrieBuilders.copyOf(
+                base,
+                String[]::new,
+                ReductionSettings.withDefaults(
+                        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+                )
+        );
+
+builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
+
+// 3. Compile
+FrequencyTrie<String> compiled = builder.build();
+
+// 4. Save
+StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
+```
+
+
+
+## Limitations
+
+* bundled dictionaries are **general-purpose**
+* they may not reflect:
+
+  * domain-specific usage
+  * rare or specialized vocabulary
+  * organization-specific terminology
+
+
+
+## Next steps
+
+* [Quick start](quick-start.md)
+* [Dictionary format](dictionary-format.md)
+* [CLI compilation](cli-compilation.md)
+* [Programmatic usage](programmatic-usage.md)
+
+
+
+## Summary
+
+Radixor’s built-in language support provides:
+
+* immediate usability
+* reference datasets
+* a starting point for customization
+
+For production systems, they are best used as:
+
+* a baseline
+* a seed for further extension
+* a source for compiled deployment artifacts
+
--- a/docs/cli-compilation.md
+++ b/docs/cli-compilation.md
@@ -0,0 +1,305 @@
+# CLI Compilation
+
+> ← Back to [README.md](../README.md)
+
+Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.
+
+This is the recommended workflow for deployment environments, as it separates:
+
+- dictionary preparation (offline)
+- stemming execution (runtime)
+
+
+
+## Overview
+
+The `Compile` tool:
+
+1. reads a line-oriented dictionary file
+2. converts word–stem pairs into patch commands
+3. builds a trie structure
+4. applies subtree reduction
+5. writes a compressed binary artifact
+
+The output is a `.radixor.gz` file suitable for fast runtime loading.
+
+
+
+## Basic usage
+
+```bash
+java org.egothor.stemmer.Compile \
+  --input ./data/stemmer.txt \
+  --output ./build/english.radixor.gz \
+  --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
+  --store-original \
+  --overwrite
+```
+
+
+
+## Required arguments
+
+### `--input`
+
+Path to the source dictionary file.
+
+* must be in the [dictionary format](dictionary-format.md)
+* must be readable
+* UTF-8 encoding is expected
+
+```
+--input ./data/stemmer.txt
+```
+
+### `--output`
+
+Path to the output binary file.
+
+* parent directories are created automatically
+* output is written as **GZip-compressed binary**
+
+```
+--output ./build/english.radixor.gz
+```
+
+
+
+## Optional arguments
+
+### `--reduction-mode`
+
+Controls how aggressively the trie is reduced during compilation.
+
+Available values:
+
+* `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
+* `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
+* `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
+
+Example:
+
+```
+--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+```
+
+#### Recommendation
+
+Use:
+
+```
+MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+```
+
+This provides:
+
+* safe behavior
+* deterministic ordering
+* good compression
+
+
+
+### `--store-original`
+
+Stores the stem itself as a no-op mapping.
+
+```
+--store-original
+```
+
+Effect:
+
+* ensures that canonical forms are always resolvable
+* improves robustness in real-world inputs
+
+Recommended for most use cases.
+
+
+
+### `--overwrite`
+
+Allows overwriting an existing output file.
+
+```
+--overwrite
+```
+
+Without this flag:
+
+* compilation fails if the output file already exists
+
+
+
+## Reduction strategy explained
+
+Reduction merges semantically equivalent subtrees to reduce memory and file size.
+
+Trade-offs:
+
+| Mode      | Compression | Behavioral fidelity |
+| --------- | ----------- | ------------------- |
+| Ranked    | Medium      | High                |
+| Unordered | High        | Medium              |
+| Dominant  | Highest     | Lower (heuristic)   |
+
+### Ranked (recommended)
+
+* preserves full `getAll()` ordering
+* safest and most predictable
+
+### Unordered
+
+* ignores ordering differences
+* higher compression, but less precise semantics
+
+### Dominant
+
+* focuses on the most frequent result
+* useful when only `get()` is relevant
+* may lose secondary candidates
+
+
+
+## Output format
+
+The compiled file:
+
+* is a binary representation of the trie
+* uses **GZip compression**
+* is optimized for:
+
+  * fast loading
+  * minimal memory footprint
+
+Typical properties:
+
+* small file size
+* fast deserialization
+* no runtime preprocessing required
+
+
+
+## Example workflow
+
+### 1. Prepare dictionary
+
+```
+run running runs ran
+connect connected connecting
+```
+
+### 2. Compile
+
+```bash
+java org.egothor.stemmer.Compile \
+  --input ./data/stemmer.txt \
+  --output ./build/english.radixor.gz \
+  --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
+  --store-original
+```
+
+### 3. Use in application
+
+```java
+FrequencyTrie<String> trie =
+    StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
+```
+
+
+
+## Error handling
+
+The CLI reports:
+
+* missing input file
+* invalid arguments
+* I/O failures
+* parsing errors
+
+Typical exit codes:
+
+* `0` – success
+* non-zero – failure
+
+Error details are printed to standard error.
+
+
+
+## Performance considerations
+
+### Compilation
+
+* typically CPU-bound
+* depends on dictionary size and reduction mode
+
+### Output size
+
+* depends on:
+
+  * dictionary completeness
+  * reduction strategy
+* can vary significantly between modes
+
+### Runtime impact
+
+* compiled tries are optimized for:
+
+  * fast lookup
+  * low allocation
+  * predictable latency
+
+
+
+## Best practices
+
+### Use offline compilation
+
+* compile dictionaries during build or deployment
+* do not compile on application startup
+
+### Version your artifacts
+
+* treat `.radixor.gz` files as versioned assets
+* store them alongside application releases
+
+### Choose reduction mode deliberately
+
+* use **ranked** for correctness
+* use **dominant** only if you fully understand the trade-offs
+
+### Keep dictionaries clean
+
+* better input → better compiled output
+* avoid noise and inconsistencies
+
+
+
+## Integration tips
+
+* store compiled files under `resources/` or a dedicated directory
+* load them once and reuse the trie instance
+* avoid repeated loading in frequently executed code paths (for example, per-request processing)
+
+
+
+## Next steps
+
+* [Dictionary format](dictionary-format.md)
+* [Programmatic usage](programmatic-usage.md)
+* [Quick start](quick-start.md)
+
+
+
+## Summary
+
+The `Compile` CLI is the bridge between:
+
+* human-readable dictionary data
+* optimized runtime stemmer tables
+
+It enables a clean separation between:
+
+* data preparation
+* runtime execution
+
+and is the preferred way to prepare Radixor for production use.
--- a/docs/dictionary-format.md
+++ b/docs/dictionary-format.md
@@ -0,0 +1,255 @@
+# Dictionary Format
+
+> ← Back to [README.md](../README.md)
+
+Radixor uses a simple, line-oriented dictionary format to define mappings between **word forms** and their **canonical stems**.
+
+This format is intentionally minimal, language-agnostic, and easy to generate from existing linguistic resources or corpora.
+
+## Overview
+
+Each logical line defines:
+
+- one **canonical stem**
+- zero or more **word variants** belonging to that stem
+
+```
+stem variant1 variant2 variant3 ...
+```
+
+At compile time:
+
+- each variant is converted into a **patch command** transforming the variant into the stem
+- the stem itself may optionally be stored as a **no-op mapping**
+
+## Basic example
+
+```
+run running runs ran
+connect connected connecting connection
+analyze analyzing analysed analyses
+```
+
+This defines:
+
+| Stem     | Variants                              |
+|----------|----------------------------------------|
+| run      | running, runs, ran                     |
+| connect  | connected, connecting, connection      |
+| analyze  | analyzing, analysed, analyses          |
+
+## Syntax rules
+
+### 1. Tokenization
+
+- Tokens are separated by **whitespace**
+- Multiple spaces and tabs are treated as a single separator
+- Leading and trailing whitespace is ignored
+
+### 2. First token is the stem
+
+- The **first token** on each line is always the canonical stem
+- All following tokens are treated as variants of that stem
+
+### 3. Case normalization
+
+- All input is normalized to **lowercase using `Locale.ROOT`**
+- Dictionaries should ideally already be lowercase to avoid ambiguity
+
+### 4. Empty lines
+
+- Empty lines are ignored
+
+### 5. Duplicate variants
+
+- Duplicate variants are allowed but have no additional effect
+- Frequency is determined by occurrence across the entire dataset
+
+## Remarks (comments)
+
+The parser supports both full-line and trailing remarks.
+
+### Supported remark markers
+
+- `#`
+- `//`
+
+### Examples
+
+```
+run running runs ran   # English verb forms
+connect connected connecting  // basic forms
+```
+
+Everything after the first occurrence of a remark marker is ignored.
+
+### Important note
+
+Remark markers are not escaped. If `#` or `//` appear in a token, they will terminate the line.
+
+## Storing the original form
+
+When compiling, you may enable:
+
+```
+--store-original
+```
+
+This causes the stem itself to be stored using a **no-op patch command**.
+
+Example:
+
+```
+run running runs
+```
+
+With `--store-original`, this implicitly includes:
+
+```
+run -> run
+```
+
+This is useful when:
+
+- the input may already be normalized
+- you want stable identity mappings
+- you want to avoid missing entries for canonical forms
+
+## Frequency and ordering
+
+Radixor tracks **local frequencies** of values.
+
+Frequency is determined by:
+
+- how many times a mapping appears during construction
+- merging behavior during reduction
+
+When multiple stems exist for a word:
+
+- results are ordered by **descending frequency**
+- ties are resolved deterministically:
+  1. shorter textual representation wins
+  2. lexicographically smaller value wins
+  3. earlier insertion order wins
+
+This guarantees **stable and reproducible results**.
+
+## Ambiguity and multiple stems
+
+A word may legitimately map to more than one stem:
+
+```
+axes ax axe
+```
+
+This allows Radixor to represent ambiguity explicitly.
+
+At runtime:
+
+- `get(word)` returns the **preferred result**
+- `getAll(word)` returns **all candidates**
+
+## Design guidelines
+
+### Keep stems consistent
+
+Use a single canonical form:
+
+- `run` instead of mixing `run` / `running`
+- `analyze` vs `analyse` — pick one convention
+
+### Avoid noise
+
+Do not include:
+
+- typos
+- extremely rare forms (unless required)
+- inconsistent normalization
+
+### Prefer completeness over clever rules
+
+Radixor is data-driven:
+
+- more complete dictionaries → better results
+- no hidden rule system compensates for missing entries
+
+### Handle domain-specific vocabulary
+
+You can extend dictionaries with:
+
+- product names
+- technical terms
+- organization-specific terminology
+
+## Example: minimal dictionary
+
+```
+go goes going went
+be is are was were being
+have has having had
+```
+
+## Example: domain-specific extension
+
+```
+microservice microservices
+container containers containerized
+kubernetes kubernetes
+```
+
+## Common pitfalls
+
+### Mixing cases
+
+```
+Run running Runs   ❌
+```
+
+→ normalized to lowercase, but inconsistent input is error-prone
+
+### Multiple stems on one line
+
+```
+run running connect   ❌
+```
+
+→ `connect` becomes a variant of `run`, which is incorrect
+
+### Hidden comments
+
+```
+run running //comment runs   ❌
+```
+
+→ everything after `//` is ignored
+
+## When to use this format
+
+This format is suitable for:
+
+- curated linguistic datasets
+- exported morphological dictionaries
+- domain-specific vocabularies
+- generated `(word, stem)` pairs from corpora
+
+## Next steps
+
+- [CLI compilation](cli-compilation.md)
+- [Programmatic usage](programmatic-usage.md)
+- [Quick start](quick-start.md)
+
+## Summary
+
+Radixor dictionaries are intentionally simple:
+
+- one line per stem
+- whitespace-separated tokens
+- optional remarks
+- no embedded rules
+
+This simplicity enables:
+
+- easy generation
+- fast parsing
+- deterministic behavior
+- efficient compilation into compact patch-command tries
--- a/docs/programmatic-usage.md
+++ b/docs/programmatic-usage.md
@@ -0,0 +1,322 @@
+# Programmatic Usage
+
+> ← Back to [README.md](../README.md)
+
+This document describes how to use **Radixor** programmatically from Java.
+
+It covers:
+
+- building a trie from dictionary data
+- compiling it into an immutable structure
+- loading compiled stemmers
+- querying for stems
+- working with multiple candidates
+- modifying existing compiled stemmers
+
+
+
+## Overview
+
+Radixor separates the stemming lifecycle into three stages:
+
+1. **Build** – collect word–stem mappings in a mutable structure  
+2. **Compile** – reduce and convert to an immutable trie  
+3. **Query** – perform fast runtime lookups  
+
+These stages are represented by:
+
+- `FrequencyTrie.Builder` (mutable)
+- `FrequencyTrie` (immutable, compiled)
+- `StemmerPatchTrieLoader` / `StemmerPatchTrieBinaryIO` (I/O)
+
+
+
+## Building a trie programmatically
+
+You can construct a trie directly without using the CLI.
+
+```java
+import org.egothor.stemmer.*;
+
+public final class BuildExample {
+
+    public static void main(String[] args) {
+        ReductionSettings settings = ReductionSettings.withDefaults(
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+        );
+
+        FrequencyTrie.Builder<String> builder =
+                new FrequencyTrie.Builder<>(String[]::new, settings);
+
+        PatchCommandEncoder encoder = new PatchCommandEncoder();
+
+        builder.put("running", encoder.encode("running", "run"));
+        builder.put("runs", encoder.encode("runs", "run"));
+        builder.put("ran", encoder.encode("ran", "run"));
+
+        FrequencyTrie<String> trie = builder.build();
+    }
+}
+```
+
+
+
+## Loading from dictionary files
+
+To parse dictionary files directly:
+
+```java
+import java.io.IOException;
+import java.nio.file.Path;
+
+import org.egothor.stemmer.*;
+
+public final class LoadFromDictionaryExample {
+
+    public static void main(String[] args) throws IOException {
+        FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+                Path.of("data/stemmer.txt"),
+                true,
+                ReductionSettings.withDefaults(
+                        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+                )
+        );
+    }
+}
+```
+
+
+
+## Loading a compiled binary trie
+
+```java
+import java.io.IOException;
+import java.nio.file.Path;
+
+import org.egothor.stemmer.*;
+
+public final class LoadBinaryExample {
+
+    public static void main(String[] args) throws IOException {
+        FrequencyTrie<String> trie =
+                StemmerPatchTrieLoader.loadBinary(Path.of("english.radixor.gz"));
+    }
+}
+```
+
+This is the **preferred production approach**.
+
+
+
+## Querying for stems
+
+### Preferred result
+
+```java
+String word = "running";
+String patch = trie.get(word);
+String stem = PatchCommandEncoder.apply(word, patch);
+```
+
+### All candidates
+
+```java
+String[] patches = trie.getAll(word);
+
+for (String patch : patches) {
+    String stem = PatchCommandEncoder.apply(word, patch);
+}
+```
+
+
+
+## Accessing value frequencies
+
+For diagnostic or advanced use cases:
+
+```java
+import org.egothor.stemmer.ValueCount;
+
+java.util.List<ValueCount<String>> entries = trie.getEntries("axes");
+
+for (ValueCount<String> entry : entries) {
+    String patch = entry.value();
+    int count = entry.count();
+}
+```
+
+This allows:
+
+* inspecting ambiguity
+* understanding ranking decisions
+* debugging dictionary quality
+
+
+
+## Using bundled language resources
+
+```java
+FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+        StemmerPatchTrieLoader.Language.US_UK_PROFI,
+        true,
+        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+);
+```
+
+Bundled dictionaries are useful for:
+
+* quick integration
+* testing
+* reference behavior
+
+
+
+## Persisting a compiled trie
+
+```java
+import java.io.IOException;
+import java.nio.file.Path;
+
+import org.egothor.stemmer.*;
+
+public final class SaveExample {
+
+    public static void main(String[] args) throws IOException {
+        StemmerPatchTrieBinaryIO.write(trie, Path.of("english.radixor.gz"));
+    }
+}
+```
+
+
+
+## Modifying an existing trie
+
+A compiled trie can be reopened into a builder, extended, and rebuilt.
+
+```java
+import java.io.IOException;
+import java.nio.file.Path;
+
+import org.egothor.stemmer.*;
+
+public final class ModifyExample {
+
+    public static void main(String[] args) throws IOException {
+        FrequencyTrie<String> compiled =
+                StemmerPatchTrieBinaryIO.read(Path.of("english.radixor.gz"));
+
+        ReductionSettings settings = ReductionSettings.withDefaults(
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+        );
+
+        FrequencyTrie.Builder<String> builder =
+                FrequencyTrieBuilders.copyOf(compiled, String[]::new, settings);
+
+        builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
+
+        FrequencyTrie<String> updated = builder.build();
+
+        StemmerPatchTrieBinaryIO.write(updated,
+                Path.of("english-custom.radixor.gz"));
+    }
+}
+```
+
+
+
+## Thread safety
+
+* `FrequencyTrie` (compiled):
+
+  * **thread-safe**
+  * safe for concurrent reads
+
+* `FrequencyTrie.Builder`:
+
+  * **not thread-safe**
+  * intended for single-threaded construction
+
+
+
+## Performance characteristics
+
+### Querying
+
+* O(length of word)
+* minimal allocations
+* suitable for high-throughput pipelines
+
+### Loading
+
+* binary loading is fast
+* no preprocessing required
+
+### Building
+
+* depends on dictionary size
+* reduction phase may be CPU-intensive
+
+
+
+## Best practices
+
+### Reuse compiled trie instances
+
+* load once
+* share across threads
+
+### Prefer binary loading in production
+
+* avoid rebuilding at runtime
+* treat compiled files as deployable artifacts
+
+### Use `getAll()` only when needed
+
+* `get()` is faster and sufficient for most use cases
+
+### Keep builders short-lived
+
+* build → compile → discard
+
+
+
+## Integration patterns
+
+### Search systems
+
+* apply stemming during indexing and querying
+* ensure consistent dictionary usage
+
+### Text normalization pipelines
+
+* integrate as a transformation step
+* combine with tokenization and filtering
+
+### Domain adaptation
+
+* extend dictionaries with domain-specific vocabulary
+* rebuild compiled artifacts
+
+
+
+## Next steps
+
+* [Dictionary format](dictionary-format.md)
+* [CLI compilation](cli-compilation.md)
+* [Architecture and reduction](architecture-and-reduction.md)
+
+
+
+## Summary
+
+Programmatic usage of Radixor follows a clear pattern:
+
+* build or load a trie
+* query using patch commands
+* apply transformations
+
+The API is intentionally simple at the surface, while providing deeper control when needed for:
+
+* ambiguity handling
+* diagnostics
+* dictionary evolution
--- a/docs/quality-and-operations.md
+++ b/docs/quality-and-operations.md
@@ -0,0 +1,317 @@
+# Quality and Operations
+
+> ← Back to [README.md](../README.md)
+
+This document describes quality, testing, and operational practices for **Radixor**.
+
+It focuses on:
+
+- reliability and determinism
+- testing strategies
+- deployment patterns
+- performance considerations
+- lifecycle management of stemmer data
+
+
+
+## Overview
+
+Radixor is designed to separate:
+
+- **data preparation** (dictionary construction and compilation)
+- **runtime execution** (lookup and patch application)
+
+This separation enables:
+
+- predictable runtime behavior
+- reproducible builds
+- controlled evolution of stemming data
+
+
+
+## Determinism and reproducibility
+
+Radixor emphasizes deterministic behavior.
+
+### Deterministic outputs
+
+Given:
+
+- the same dictionary input
+- the same reduction settings
+
+Radixor guarantees:
+
+- identical compiled trie structure
+- identical value ordering
+- identical lookup results
+
+### Why this matters
+
+- stable search behavior across deployments
+- reproducible builds
+- easier debugging and regression analysis
+
+
+
+## Testing strategy
+
+### Unit testing
+
+Core components should be tested independently:
+
+- patch encoding and decoding
+- trie construction
+- reduction behavior
+- binary serialization and deserialization
+
+### Dictionary validation tests
+
+A recommended pattern:
+
+1. load dictionary input
+2. compile trie
+3. re-apply all word → stem mappings
+4. verify that:
+
+- expected stem is present in `getAll()`
+- preferred result (`get()`) is correct when deterministic
+
+This ensures:
+
+- no data loss during reduction
+- correctness of patch encoding
+
+
+
+## Regression testing
+
+Maintain a stable test dataset:
+
+- representative vocabulary
+- edge cases (short words, long words, ambiguous forms)
+
+Use it to:
+
+- detect unintended changes
+- verify behavior after refactoring
+- validate reduction mode changes
+
+
+
+## Performance testing
+
+Performance should be evaluated in terms of:
+
+### Throughput
+
+- words processed per second
+
+### Latency
+
+- time per lookup
+
+### Memory footprint
+
+- size of compiled trie
+- runtime memory usage
+
+Benchmark with:
+
+- realistic token streams
+- production-like dictionaries
+
+
+
+## Deployment model
+
+### Recommended workflow
+
+1. prepare dictionary data
+2. compile using CLI
+3. store `.radixor.gz` artifact
+4. deploy artifact with application
+5. load using `loadBinary(...)`
+
+### Why this model
+
+- avoids runtime compilation overhead
+- reduces startup latency
+- ensures consistent behavior across environments
+
+
+
+## Artifact management
+
+Compiled stemmers should be treated as versioned assets.
+
+### Versioning
+
+- include version in filename or metadata
+- track dictionary source and reduction settings
+
+Example:
+
+```
+english-v1.2-ranked.radixor.gz
+```
+
+### Storage
+
+- store in repository or artifact storage
+- ensure consistent distribution across environments
+
+
+
+## Runtime usage
+
+### Loading
+
+- load once during application startup
+- reuse `FrequencyTrie` instance
+
+### Thread safety
+
+- compiled trie is safe for concurrent access
+- no synchronization required for reads
+
+### Avoid repeated loading
+
+Do not:
+
+- load trie per request
+- rebuild trie at runtime
+
+
+
+## Memory considerations
+
+- compiled tries are compact but not negligible
+- size depends on:
+  - dictionary size
+  - reduction mode
+
+Recommendations:
+
+- monitor memory usage in production
+- choose reduction mode appropriately
+
+
+
+## Reduction mode in production
+
+Default recommendation:
+
+- use **ranked mode**
+
+Switch to other modes only when:
+
+- memory constraints are strict
+- multiple candidate results are not required
+
+Always validate behavior after changing reduction mode.
+
+
+
+## Dictionary lifecycle
+
+### Updating dictionaries
+
+When dictionary data changes:
+
+1. update source file
+2. recompile
+3. run validation tests
+4. deploy new artifact
+
+### Backward compatibility
+
+- changes in dictionary may affect stemming results
+- evaluate impact on search relevance
+
+
+
+## Observability
+
+Radixor itself does not provide observability features; integration should provide:
+
+- logging for loading failures
+- metrics for lookup throughput
+- monitoring of memory usage
+
+Optional:
+
+- sampling of ambiguous results (`getAll()`)
+
+
+
+## Error handling
+
+### During compilation
+
+Handle:
+
+- invalid dictionary format
+- I/O failures
+- invalid arguments
+
+### During runtime
+
+Handle:
+
+- missing dictionary files
+- corrupted binary artifacts
+
+Fail fast on initialization errors.
+
+
+
+## Operational best practices
+
+- compile dictionaries offline
+- version compiled artifacts
+- test before deployment
+- load once and reuse
+- monitor performance and memory
+- document reduction settings used
+
+
+
+## Security considerations
+
+- treat dictionary input as trusted data
+- validate external sources before compilation
+- avoid loading unverified binary artifacts
+
+
+
+## Integration checklist
+
+Before production deployment:
+
+- dictionary validated
+- compiled artifact generated
+- reduction mode documented
+- performance tested
+- memory usage verified
+- regression tests passing
+
+
+
+## Next steps
+
+- [Quick start](quick-start.md)
+- [CLI compilation](cli-compilation.md)
+- [Programmatic usage](programmatic-usage.md)
+
+
+
+## Summary
+
+Radixor is designed for:
+
+- deterministic behavior
+- efficient runtime execution
+- controlled data-driven evolution
+
+By separating compilation from runtime and following proper operational practices, it can be reliably integrated into production-grade systems.
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -0,0 +1,148 @@
+# Quick Start
+
+> ← Back to [README.md](../README.md)
+
+This guide shows the fastest way to start using **Radixor** and the most common next steps.
+
+## Hello world
+
+```java
+import java.io.IOException;
+
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.PatchCommandEncoder;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+public final class HelloRadixor {
+
+    private HelloRadixor() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+                StemmerPatchTrieLoader.Language.US_UK_PROFI,
+                true,
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
+
+        final String word = "running";
+        final String patch = trie.get(word);
+        final String stem = PatchCommandEncoder.apply(word, patch);
+
+        System.out.println(word + " -> " + stem);
+    }
+}
+```
+
+This example shows the core workflow:
+
+1. load a trie
+2. get a patch command for a word
+3. apply the patch
+4. obtain the stem
+
+## Retrieve multiple candidate stems
+
+If you need more than one candidate result, use `getAll(...)` instead of `get(...)`.
+
+```java
+final String word = "axes";
+final String[] patches = trie.getAll(word);
+
+for (String patch : patches) {
+    final String stem = PatchCommandEncoder.apply(word, patch);
+    System.out.println(word + " -> " + stem + " (" + patch + ")");
+}
+```
+
+## Load a compiled binary stemmer
+
+For production systems, the preferred approach is usually to precompile the dictionary and load the compressed binary artifact at runtime.
+
+```java
+import java.io.IOException;
+import java.nio.file.Path;
+
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.PatchCommandEncoder;
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+public final class BinaryStemmerExample {
+
+    private BinaryStemmerExample() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        final Path path = Path.of("stemmers", "english.radixor.gz");
+        final FrequencyTrie<String> trie = StemmerPatchTrieLoader.loadBinary(path);
+
+        final String word = "connected";
+        final String patch = trie.get(word);
+        final String stem = PatchCommandEncoder.apply(word, patch);
+
+        System.out.println(word + " -> " + stem);
+    }
+}
+```
+
+## Compile a dictionary from the command line
+
+```bash
+java org.egothor.stemmer.Compile \
+    --input ./data/stemmer.txt \
+    --output ./build/english.radixor.gz \
+    --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
+    --store-original \
+    --overwrite
+```
+
+## Modify an existing compiled stemmer
+
+```java
+import java.io.IOException;
+import java.nio.file.Path;
+
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.FrequencyTrieBuilders;
+import org.egothor.stemmer.PatchCommandEncoder;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.ReductionSettings;
+import org.egothor.stemmer.StemmerPatchTrieBinaryIO;
+
+public final class ModifyCompiledExample {
+
+    private ModifyCompiledExample() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        final Path input = Path.of("stemmers", "english.radixor.gz");
+        final Path output = Path.of("stemmers", "english-custom.radixor.gz");
+
+        final FrequencyTrie<String> compiledTrie = StemmerPatchTrieBinaryIO.read(input);
+
+        final ReductionSettings settings = ReductionSettings.withDefaults(
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
+
+        final FrequencyTrie.Builder<String> builder = FrequencyTrieBuilders.copyOf(
+                compiledTrie,
+                String[]::new,
+                settings);
+
+        builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
+
+        final FrequencyTrie<String> updatedTrie = builder.build();
+        StemmerPatchTrieBinaryIO.write(updatedTrie, output);
+    }
+}
+```
+
+## Where to continue
+
+* [Dictionary format](dictionary-format.md)
+* [CLI compilation](cli-compilation.md)
+* [Programmatic usage](programmatic-usage.md)
+* [Built-in languages](built-in-languages.md)
+* [Architecture and reduction](architecture-and-reduction.md)