Files
Radixor/docs/cli-compilation.md
Leo Galambos 041b7f43fb Practical improvements
fix: cli-compilation doc is missing some params
chore: ExperimentCli is not relevant for JaCoCo
feat: human-readable format of trie metadata
fix: some new JUnit-s added
2026-04-23 23:43:25 +02:00

279 lines
9.7 KiB
Markdown

# CLI Compilation
Radixor provides a command-line compiler for turning line-oriented dictionary files into compact binary stemmer artifacts.
This is the preferred preparation workflow when stemming should run against an already compiled artifact rather than against raw dictionary input. The CLI reads the dictionary, derives patch commands, builds a mutable trie, applies the selected subtree reduction strategy, and writes the final compiled trie in the project binary format under GZip compression. The result is a deployment-ready `.radixor.gz` file that can be loaded directly by application code.
## What the CLI does
The `Compile` tool performs the following steps:
1. reads the input dictionary in the standard Radixor stemmer format,
2. parses each line into a canonical stem column and its known variant columns,
3. converts variants into patch commands,
4. builds a mutable trie of patch-command values,
5. applies the configured reduction mode,
6. writes the compiled trie as a GZip-compressed binary artifact.
This workflow is intentionally aligned with the same dictionary semantics used elsewhere in the library. Remarks introduced by `#` or `//` are supported through the shared dictionary parser.
## Basic usage
```bash
java org.egothor.stemmer.Compile \
--input ./data/stemmer.tsv \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--case-processing-mode LOWERCASE_WITH_LOCALE_ROOT \
--store-original \
--overwrite
```
## Supported arguments
The CLI supports the following arguments:
```text
--input <file>
--output <file>
--reduction-mode <mode>
[--store-original]
[--right-to-left]
[--case-processing-mode <mode>]
[--dominant-winner-min-percent <1..100>]
[--dominant-winner-over-second-ratio <1..n>]
[--overwrite]
[--help]
```
### `--input <file>`
Path to the source dictionary file.
The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
Example:
```text
--input ./data/stemmer.tsv
```
### `--output <file>`
Path to the output binary artifact.
The output file is written as a GZip-compressed binary trie. Parent directories are created automatically when needed.
Example:
```text
--output ./build/english.radixor.gz
```
### `--reduction-mode <mode>`
Selects the subtree reduction strategy used during compilation.
Supported values are:
- `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
- `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
- `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
Example:
```text
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```
This argument is required.
### `--store-original`
When this flag is present, the canonical stem itself is inserted using the no-op patch command.
```text
--store-original
```
This is usually a sensible default for real dictionaries because it ensures that canonical forms are directly representable in the compiled trie rather than relying only on their variants.
### `--right-to-left`
When present, compilation uses forward traversal (`WordTraversalDirection.FORWARD`) so stored forms are processed from their logical beginning.
```text
--right-to-left
```
This option is intended for right-to-left languages where affix behavior should operate on the written form without externally reversing words.
### `--case-processing-mode <mode>`
Controls dictionary key normalization during compilation and lookup.
Supported values are:
- `LOWERCASE_WITH_LOCALE_ROOT` (default)
- `AS_IS`
Example:
```text
--case-processing-mode AS_IS
```
### `--dominant-winner-min-percent <1..100>`
Sets the minimum winner percentage used by dominant-result reduction settings.
Example:
```text
--dominant-winner-min-percent 75
```
This option matters primarily when `--reduction-mode` is `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`. The default value is `75`.
### `--dominant-winner-over-second-ratio <1..n>`
Sets the minimum winner-over-second ratio used by dominant-result reduction settings.
Example:
```text
--dominant-winner-over-second-ratio 3
```
This option also matters primarily for `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`. The default value is `3`.
### `--overwrite`
Allows the CLI to replace an already existing output file.
```text
--overwrite
```
Without this flag, compilation fails when the output path already exists.
### `--help`
Prints usage help and exits successfully.
```text
--help
```
The short form `-h` is also supported.
## Reduction modes in practice
Reduction mode is not only a storage decision. It also influences what semantics are preserved when the mutable trie is compiled into its canonical read-only form.
### Ranked `getAll()` equivalence
`MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS` merges subtrees whose `getAll()` results remain equivalent for every reachable key suffix and whose local result ordering is the same.
This is the best general-purpose choice when result ordering and ambiguity handling matter. It preserves ranked multi-result semantics while still achieving useful structural reduction.
This is the recommended default for most users.
### Unordered `getAll()` equivalence
`MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS` also uses `getAll()`-level equivalence, but it ignores local ordering differences in addition to absolute frequencies.
This can yield stronger reduction, but it also weakens the precision of ordered multi-result semantics.
Choose this mode only when the application does not depend on the ordering of alternative results.
### Dominant `get()` equivalence
`MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS` focuses on preserving preferred-result semantics for `get()`, subject to dominance thresholds.
If a node does not satisfy the configured dominance constraints, compilation falls back to ranked `getAll()` semantics for that node to avoid unsafe over-reduction.
This mode is most suitable when the application primarily consumes the preferred result and does not rely on preserving richer ambiguity information.
## Recommended usage patterns
### Use offline preparation
The CLI is best used as a preparation step during packaging, deployment, or controlled artifact generation. This keeps compilation outside the runtime startup path and allows services to load only the finished binary trie.
### Treat compiled files as versioned assets
A `.radixor.gz` file should be handled as a versioned output artifact. It represents a specific dictionary state, a specific reduction mode, and, where relevant, specific dominant-result thresholds.
Compiled tries also persist a human-readable metadata block (`key=value` lines) that includes traversal direction, RTL indicator, reduction mode, case-processing mode, and dominant thresholds. After decompression, you can inspect this block directly to identify what dictionary/trie configuration the artifact contains.
### Choose reduction mode deliberately
The ranked `getAll()` mode is the safest default. The unordered and dominant modes should be chosen only when their trade-offs are acceptable for the consuming application.
### Expect memory pressure during preparation, not runtime
Compilation is usually a one-time step and is generally fast. The more important operational consideration is memory usage during preparation, because the dictionary-derived mutable structure exists before reduction compacts it into the final read-only trie. This is especially relevant for very large source dictionaries.
## Example workflow
### 1. Prepare a dictionary
```text
run running runs ran
connect connected connecting
```
### 2. Compile it
```bash
java org.egothor.stemmer.Compile \
--input ./data/stemmer.tsv \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original
```
### 3. Load it in an application
```java
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.StemmerPatchTrieLoader;
final FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
```
## Exit codes and error handling
The CLI uses three exit outcomes:
- `0` for success,
- `1` for processing failures such as I/O or compilation errors,
- `2` for invalid command-line usage.
When argument parsing fails, the CLI prints the error message, prints the usage summary, and exits with usage error status.
When compilation fails during processing, the CLI prints a `Compilation failed: ...` message to standard error and exits with processing error status.
Examples of failure conditions include:
- missing required arguments,
- unknown arguments,
- invalid integer values for dominant thresholds,
- missing input files,
- unreadable input,
- existing output file without `--overwrite`,
- general I/O failures during reading or writing.
## Relation to programmatic usage
The CLI and the programmatic API implement the same conceptual preparation step. The CLI is the operationally convenient choice when you want a ready-made binary artifact. The programmatic API is the better fit when compilation must be integrated directly into custom Java workflows.
## Next steps
- [Dictionary format](dictionary-format.md)
- [Quick start](quick-start.md)
- [Programmatic usage](programmatic-usage.md)
- [Architecture and reduction](architecture-and-reduction.md)