5.3 KiB
CLI Compilation
Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.
This is the recommended workflow for deployment environments, as it separates:
- dictionary preparation (offline)
- stemming execution (runtime)
Overview
The Compile tool:
- reads a line-oriented dictionary file
- converts word–stem pairs into patch commands
- builds a trie structure
- applies subtree reduction
- writes a compressed binary artifact
The output is a .radixor.gz file suitable for fast runtime loading.
Basic usage
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original \
--overwrite
Required arguments
--input
Path to the source dictionary file.
- must be in the dictionary format
- must be readable
- UTF-8 encoding is expected
--input ./data/stemmer.txt
--output
Path to the output binary file.
- parent directories are created automatically
- output is written as GZip-compressed binary
--output ./build/english.radixor.gz
Optional arguments
--reduction-mode
Controls how aggressively the trie is reduced during compilation.
Available values:
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTSMERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTSMERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS
Example:
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
Recommendation
Use:
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
This provides:
- safe behavior
- deterministic ordering
- good compression
--store-original
Stores the stem itself as a no-op mapping.
--store-original
Effect:
- ensures that canonical forms are always resolvable
- improves robustness in real-world inputs
Recommended for most use cases.
--overwrite
Allows overwriting an existing output file.
--overwrite
Without this flag:
- compilation fails if the output file already exists
Reduction strategy explained
Reduction merges semantically equivalent subtrees to reduce memory and file size.
Trade-offs:
| Mode | Compression | Behavioral fidelity |
|---|---|---|
| Ranked | Medium | High |
| Unordered | High | Medium |
| Dominant | Highest | Lower (heuristic) |
Ranked (recommended)
- preserves full
getAll()ordering - safest and most predictable
Unordered
- ignores ordering differences
- higher compression, but less precise semantics
Dominant
- focuses on the most frequent result
- useful when only
get()is relevant - may lose secondary candidates
Output format
The compiled file:
-
is a binary representation of the trie
-
uses GZip compression
-
is optimized for:
- fast loading
- minimal memory footprint
Typical properties:
- small file size
- fast deserialization
- no runtime preprocessing required
Example workflow
1. Prepare dictionary
run running runs ran
connect connected connecting
2. Compile
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original
3. Use in application
FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
Error handling
The CLI reports:
- missing input file
- invalid arguments
- I/O failures
- parsing errors
Typical exit codes:
0– success- non-zero – failure
Error details are printed to standard error.
Performance considerations
Compilation
- typically CPU-bound
- depends on dictionary size and reduction mode
Output size
-
depends on:
- dictionary completeness
- reduction strategy
-
can vary significantly between modes
Runtime impact
-
compiled tries are optimized for:
- fast lookup
- low allocation
- predictable latency
Best practices
Use offline compilation
- compile dictionaries during build or deployment
- do not compile on application startup
Version your artifacts
- treat
.radixor.gzfiles as versioned assets - store them alongside application releases
Choose reduction mode deliberately
- use ranked for correctness
- use dominant only if you fully understand the trade-offs
Keep dictionaries clean
- better input → better compiled output
- avoid noise and inconsistencies
Integration tips
- store compiled files under
resources/or a dedicated directory - load them once and reuse the trie instance
- avoid repeated loading in frequently executed code paths (for example, per-request processing)
Next steps
Summary
The Compile CLI is the bridge between:
- human-readable dictionary data
- optimized runtime stemmer tables
It enables a clean separation between:
- data preparation
- runtime execution
and is the preferred way to prepare Radixor for production use.