Files
Radixor/docs/cli-compilation.md

304 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CLI Compilation
Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.
This is the recommended workflow for deployment environments, as it separates:
- dictionary preparation (offline)
- stemming execution (runtime)
## Overview
The `Compile` tool:
1. reads a line-oriented dictionary file
2. converts wordstem pairs into patch commands
3. builds a trie structure
4. applies subtree reduction
5. writes a compressed binary artifact
The output is a `.radixor.gz` file suitable for fast runtime loading.
## Basic usage
```bash
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original \
--overwrite
```
## Required arguments
### `--input`
Path to the source dictionary file.
* must be in the [dictionary format](dictionary-format.md)
* must be readable
* UTF-8 encoding is expected
```
--input ./data/stemmer.txt
```
### `--output`
Path to the output binary file.
* parent directories are created automatically
* output is written as **GZip-compressed binary**
```
--output ./build/english.radixor.gz
```
## Optional arguments
### `--reduction-mode`
Controls how aggressively the trie is reduced during compilation.
Available values:
* `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
* `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
* `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
Example:
```
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```
#### Recommendation
Use:
```
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```
This provides:
* safe behavior
* deterministic ordering
* good compression
### `--store-original`
Stores the stem itself as a no-op mapping.
```
--store-original
```
Effect:
* ensures that canonical forms are always resolvable
* improves robustness in real-world inputs
Recommended for most use cases.
### `--overwrite`
Allows overwriting an existing output file.
```
--overwrite
```
Without this flag:
* compilation fails if the output file already exists
## Reduction strategy explained
Reduction merges semantically equivalent subtrees to reduce memory and file size.
Trade-offs:
| Mode | Compression | Behavioral fidelity |
| --------- | ----------- | ------------------- |
| Ranked | Medium | High |
| Unordered | High | Medium |
| Dominant | Highest | Lower (heuristic) |
### Ranked (recommended)
* preserves full `getAll()` ordering
* safest and most predictable
### Unordered
* ignores ordering differences
* higher compression, but less precise semantics
### Dominant
* focuses on the most frequent result
* useful when only `get()` is relevant
* may lose secondary candidates
## Output format
The compiled file:
* is a binary representation of the trie
* uses **GZip compression**
* is optimized for:
* fast loading
* minimal memory footprint
Typical properties:
* small file size
* fast deserialization
* no runtime preprocessing required
## Example workflow
### 1. Prepare dictionary
```
run running runs ran
connect connected connecting
```
### 2. Compile
```bash
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original
```
### 3. Use in application
```java
FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
```
## Error handling
The CLI reports:
* missing input file
* invalid arguments
* I/O failures
* parsing errors
Typical exit codes:
* `0` success
* non-zero failure
Error details are printed to standard error.
## Performance considerations
### Compilation
* typically CPU-bound
* depends on dictionary size and reduction mode
### Output size
* depends on:
* dictionary completeness
* reduction strategy
* can vary significantly between modes
### Runtime impact
* compiled tries are optimized for:
* fast lookup
* low allocation
* predictable latency
## Best practices
### Use offline compilation
* compile dictionaries during build or deployment
* do not compile on application startup
### Version your artifacts
* treat `.radixor.gz` files as versioned assets
* store them alongside application releases
### Choose reduction mode deliberately
* use **ranked** for correctness
* use **dominant** only if you fully understand the trade-offs
### Keep dictionaries clean
* better input → better compiled output
* avoid noise and inconsistencies
## Integration tips
* store compiled files under `resources/` or a dedicated directory
* load them once and reuse the trie instance
* avoid repeated loading in frequently executed code paths (for example, per-request processing)
## Next steps
* [Dictionary format](dictionary-format.md)
* [Programmatic usage](programmatic-usage.md)
* [Quick start](quick-start.md)
## Summary
The `Compile` CLI is the bridge between:
* human-readable dictionary data
* optimized runtime stemmer tables
It enables a clean separation between:
* data preparation
* runtime execution
and is the preferred way to prepare Radixor for production use.