Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening
This commit is contained in:
2026-04-13 02:10:46 +02:00
parent 15248c92c9
commit 038514bad0
64 changed files with 190190 additions and 20 deletions

305
docs/cli-compilation.md Normal file
View File

@@ -0,0 +1,305 @@
# CLI Compilation
> ← Back to [README.md](../README.md)
Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.
This is the recommended workflow for deployment environments, as it separates:
- dictionary preparation (offline)
- stemming execution (runtime)
## Overview
The `Compile` tool:
1. reads a line-oriented dictionary file
2. converts wordstem pairs into patch commands
3. builds a trie structure
4. applies subtree reduction
5. writes a compressed binary artifact
The output is a `.radixor.gz` file suitable for fast runtime loading.
## Basic usage
```bash
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original \
--overwrite
```
## Required arguments
### `--input`
Path to the source dictionary file.
* must be in the [dictionary format](dictionary-format.md)
* must be readable
* UTF-8 encoding is expected
```
--input ./data/stemmer.txt
```
### `--output`
Path to the output binary file.
* parent directories are created automatically
* output is written as **GZip-compressed binary**
```
--output ./build/english.radixor.gz
```
## Optional arguments
### `--reduction-mode`
Controls how aggressively the trie is reduced during compilation.
Available values:
* `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
* `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
* `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
Example:
```
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```
#### Recommendation
Use:
```
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```
This provides:
* safe behavior
* deterministic ordering
* good compression
### `--store-original`
Stores the stem itself as a no-op mapping.
```
--store-original
```
Effect:
* ensures that canonical forms are always resolvable
* improves robustness in real-world inputs
Recommended for most use cases.
### `--overwrite`
Allows overwriting an existing output file.
```
--overwrite
```
Without this flag:
* compilation fails if the output file already exists
## Reduction strategy explained
Reduction merges semantically equivalent subtrees to reduce memory and file size.
Trade-offs:
| Mode | Compression | Behavioral fidelity |
| --------- | ----------- | ------------------- |
| Ranked | Medium | High |
| Unordered | High | Medium |
| Dominant | Highest | Lower (heuristic) |
### Ranked (recommended)
* preserves full `getAll()` ordering
* safest and most predictable
### Unordered
* ignores ordering differences
* higher compression, but less precise semantics
### Dominant
* focuses on the most frequent result
* useful when only `get()` is relevant
* may lose secondary candidates
## Output format
The compiled file:
* is a binary representation of the trie
* uses **GZip compression**
* is optimized for:
* fast loading
* minimal memory footprint
Typical properties:
* small file size
* fast deserialization
* no runtime preprocessing required
## Example workflow
### 1. Prepare dictionary
```
run running runs ran
connect connected connecting
```
### 2. Compile
```bash
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original
```
### 3. Use in application
```java
FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
```
## Error handling
The CLI reports:
* missing input file
* invalid arguments
* I/O failures
* parsing errors
Typical exit codes:
* `0` success
* non-zero failure
Error details are printed to standard error.
## Performance considerations
### Compilation
* typically CPU-bound
* depends on dictionary size and reduction mode
### Output size
* depends on:
* dictionary completeness
* reduction strategy
* can vary significantly between modes
### Runtime impact
* compiled tries are optimized for:
* fast lookup
* low allocation
* predictable latency
## Best practices
### Use offline compilation
* compile dictionaries during build or deployment
* do not compile on application startup
### Version your artifacts
* treat `.radixor.gz` files as versioned assets
* store them alongside application releases
### Choose reduction mode deliberately
* use **ranked** for correctness
* use **dominant** only if you fully understand the trade-offs
### Keep dictionaries clean
* better input → better compiled output
* avoid noise and inconsistencies
## Integration tips
* store compiled files under `resources/` or a dedicated directory
* load them once and reuse the trie instance
* avoid repeated loading in frequently executed code paths (for example, per-request processing)
## Next steps
* [Dictionary format](dictionary-format.md)
* [Programmatic usage](programmatic-usage.md)
* [Quick start](quick-start.md)
## Summary
The `Compile` CLI is the bridge between:
* human-readable dictionary data
* optimized runtime stemmer tables
It enables a clean separation between:
* data preparation
* runtime execution
and is the preferred way to prepare Radixor for production use.