Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
2026-04-13 02:10:46 +02:00
parent 15248c92c9
commit 038514bad0
64 changed files with 190190 additions and 20 deletions
--- a/docs/cli-compilation.md
+++ b/docs/cli-compilation.md
@@ -0,0 +1,305 @@
+# CLI Compilation
+
+> ← Back to [README.md](../README.md)
+
+Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.
+
+This is the recommended workflow for deployment environments, as it separates:
+
+- dictionary preparation (offline)
+- stemming execution (runtime)
+
+
+
+## Overview
+
+The `Compile` tool:
+
+1. reads a line-oriented dictionary file
+2. converts word–stem pairs into patch commands
+3. builds a trie structure
+4. applies subtree reduction
+5. writes a compressed binary artifact
+
+The output is a `.radixor.gz` file suitable for fast runtime loading.
+
+
+
+## Basic usage
+
+```bash
+java org.egothor.stemmer.Compile \
+  --input ./data/stemmer.txt \
+  --output ./build/english.radixor.gz \
+  --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
+  --store-original \
+  --overwrite
+```
+
+
+
+## Required arguments
+
+### `--input`
+
+Path to the source dictionary file.
+
+* must be in the [dictionary format](dictionary-format.md)
+* must be readable
+* UTF-8 encoding is expected
+
+```
+--input ./data/stemmer.txt
+```
+
+### `--output`
+
+Path to the output binary file.
+
+* parent directories are created automatically
+* output is written as **GZip-compressed binary**
+
+```
+--output ./build/english.radixor.gz
+```
+
+
+
+## Optional arguments
+
+### `--reduction-mode`
+
+Controls how aggressively the trie is reduced during compilation.
+
+Available values:
+
+* `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
+* `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
+* `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
+
+Example:
+
+```
+--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+```
+
+#### Recommendation
+
+Use:
+
+```
+MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+```
+
+This provides:
+
+* safe behavior
+* deterministic ordering
+* good compression
+
+
+
+### `--store-original`
+
+Stores the stem itself as a no-op mapping.
+
+```
+--store-original
+```
+
+Effect:
+
+* ensures that canonical forms are always resolvable
+* improves robustness in real-world inputs
+
+Recommended for most use cases.
+
+
+
+### `--overwrite`
+
+Allows overwriting an existing output file.
+
+```
+--overwrite
+```
+
+Without this flag:
+
+* compilation fails if the output file already exists
+
+
+
+## Reduction strategy explained
+
+Reduction merges semantically equivalent subtrees to reduce memory and file size.
+
+Trade-offs:
+
+| Mode      | Compression | Behavioral fidelity |
+| --------- | ----------- | ------------------- |
+| Ranked    | Medium      | High                |
+| Unordered | High        | Medium              |
+| Dominant  | Highest     | Lower (heuristic)   |
+
+### Ranked (recommended)
+
+* preserves full `getAll()` ordering
+* safest and most predictable
+
+### Unordered
+
+* ignores ordering differences
+* higher compression, but less precise semantics
+
+### Dominant
+
+* focuses on the most frequent result
+* useful when only `get()` is relevant
+* may lose secondary candidates
+
+
+
+## Output format
+
+The compiled file:
+
+* is a binary representation of the trie
+* uses **GZip compression**
+* is optimized for:
+
+  * fast loading
+  * minimal memory footprint
+
+Typical properties:
+
+* small file size
+* fast deserialization
+* no runtime preprocessing required
+
+
+
+## Example workflow
+
+### 1. Prepare dictionary
+
+```
+run running runs ran
+connect connected connecting
+```
+
+### 2. Compile
+
+```bash
+java org.egothor.stemmer.Compile \
+  --input ./data/stemmer.txt \
+  --output ./build/english.radixor.gz \
+  --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
+  --store-original
+```
+
+### 3. Use in application
+
+```java
+FrequencyTrie<String> trie =
+    StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
+```
+
+
+
+## Error handling
+
+The CLI reports:
+
+* missing input file
+* invalid arguments
+* I/O failures
+* parsing errors
+
+Typical exit codes:
+
+* `0` – success
+* non-zero – failure
+
+Error details are printed to standard error.
+
+
+
+## Performance considerations
+
+### Compilation
+
+* typically CPU-bound
+* depends on dictionary size and reduction mode
+
+### Output size
+
+* depends on:
+
+  * dictionary completeness
+  * reduction strategy
+* can vary significantly between modes
+
+### Runtime impact
+
+* compiled tries are optimized for:
+
+  * fast lookup
+  * low allocation
+  * predictable latency
+
+
+
+## Best practices
+
+### Use offline compilation
+
+* compile dictionaries during build or deployment
+* do not compile on application startup
+
+### Version your artifacts
+
+* treat `.radixor.gz` files as versioned assets
+* store them alongside application releases
+
+### Choose reduction mode deliberately
+
+* use **ranked** for correctness
+* use **dominant** only if you fully understand the trade-offs
+
+### Keep dictionaries clean
+
+* better input → better compiled output
+* avoid noise and inconsistencies
+
+
+
+## Integration tips
+
+* store compiled files under `resources/` or a dedicated directory
+* load them once and reuse the trie instance
+* avoid repeated loading in frequently executed code paths (for example, per-request processing)
+
+
+
+## Next steps
+
+* [Dictionary format](dictionary-format.md)
+* [Programmatic usage](programmatic-usage.md)
+* [Quick start](quick-start.md)
+
+
+
+## Summary
+
+The `Compile` CLI is the bridge between:
+
+* human-readable dictionary data
+* optimized runtime stemmer tables
+
+It enables a clean separation between:
+
+* data preparation
+* runtime execution
+
+and is the preferred way to prepare Radixor for production use.