docs: improve README, MkDocs content, branding assets, and site polish

This commit is contained in:
2026-04-19 00:18:42 +02:00
parent db79dd2d4f
commit 0b674a39a8
19 changed files with 1836 additions and 1698 deletions

View File

@@ -1,303 +1,248 @@
# CLI Compilation
Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.
Radixor provides a command-line compiler for turning line-oriented dictionary files into compact binary stemmer artifacts.
This is the recommended workflow for deployment environments, as it separates:
This is the preferred preparation workflow when stemming should run against an already compiled artifact rather than against raw dictionary input. The CLI reads the dictionary, derives patch commands, builds a mutable trie, applies the selected subtree reduction strategy, and writes the final compiled trie in the project binary format under GZip compression. The result is a deployment-ready `.radixor.gz` file that can be loaded directly by application code.
- dictionary preparation (offline)
- stemming execution (runtime)
## What the CLI does
The `Compile` tool performs the following steps:
1. reads the input dictionary in the standard Radixor stemmer format,
2. parses each line into a canonical stem and its known variants,
3. converts variants into patch commands,
4. builds a mutable trie of patch-command values,
5. applies the configured reduction mode,
6. writes the compiled trie as a GZip-compressed binary artifact.
## Overview
The `Compile` tool:
1. reads a line-oriented dictionary file
2. converts wordstem pairs into patch commands
3. builds a trie structure
4. applies subtree reduction
5. writes a compressed binary artifact
The output is a `.radixor.gz` file suitable for fast runtime loading.
This workflow is intentionally aligned with the same dictionary semantics used elsewhere in the library. Remarks introduced by `#` or `//` are supported through the shared dictionary parser.
## Basic usage
```bash
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original \
--overwrite
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original \
--overwrite
```
## Supported arguments
The CLI supports the following arguments:
## Required arguments
```text
--input <file>
--output <file>
--reduction-mode <mode>
[--store-original]
[--dominant-winner-min-percent <1..100>]
[--dominant-winner-over-second-ratio <1..n>]
[--overwrite]
[--help]
```
### `--input`
### `--input <file>`
Path to the source dictionary file.
* must be in the [dictionary format](dictionary-format.md)
* must be readable
* UTF-8 encoding is expected
```
--input ./data/stemmer.txt
```
### `--output`
Path to the output binary file.
* parent directories are created automatically
* output is written as **GZip-compressed binary**
```
--output ./build/english.radixor.gz
```
## Optional arguments
### `--reduction-mode`
Controls how aggressively the trie is reduced during compilation.
Available values:
* `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
* `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
* `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
The file must use the standard line-oriented dictionary format. Each non-empty logical line starts with the canonical stem and may contain zero or more variants. The parser expects UTF-8 input, lowercases it using `Locale.ROOT`, and ignores trailing remarks introduced by `#` or `//`.
Example:
```text
--input ./data/stemmer.txt
```
### `--output <file>`
Path to the output binary artifact.
The output file is written as a GZip-compressed binary trie. Parent directories are created automatically when needed.
Example:
```text
--output ./build/english.radixor.gz
```
### `--reduction-mode <mode>`
Selects the subtree reduction strategy used during compilation.
Supported values are:
- `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
- `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
- `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
Example:
```text
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```
#### Recommendation
Use:
```
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```
This provides:
* safe behavior
* deterministic ordering
* good compression
This argument is required.
### `--store-original`
Stores the stem itself as a no-op mapping.
When this flag is present, the canonical stem itself is inserted using the no-op patch command.
```
```text
--store-original
```
Effect:
This is usually a sensible default for real dictionaries because it ensures that canonical forms are directly representable in the compiled trie rather than relying only on their variants.
* ensures that canonical forms are always resolvable
* improves robustness in real-world inputs
### `--dominant-winner-min-percent <1..100>`
Recommended for most use cases.
Sets the minimum winner percentage used by dominant-result reduction settings.
Example:
```text
--dominant-winner-min-percent 75
```
This option matters primarily when `--reduction-mode` is `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`. The default value is `75`.
### `--dominant-winner-over-second-ratio <1..n>`
Sets the minimum winner-over-second ratio used by dominant-result reduction settings.
Example:
```text
--dominant-winner-over-second-ratio 3
```
This option also matters primarily for `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`. The default value is `3`.
### `--overwrite`
Allows overwriting an existing output file.
Allows the CLI to replace an already existing output file.
```
```text
--overwrite
```
Without this flag:
Without this flag, compilation fails when the output path already exists.
* compilation fails if the output file already exists
### `--help`
Prints usage help and exits successfully.
```text
--help
```
## Reduction strategy explained
The short form `-h` is also supported.
Reduction merges semantically equivalent subtrees to reduce memory and file size.
## Reduction modes in practice
Trade-offs:
Reduction mode is not only a storage decision. It also influences what semantics are preserved when the mutable trie is compiled into its canonical read-only form.
| Mode | Compression | Behavioral fidelity |
| --------- | ----------- | ------------------- |
| Ranked | Medium | High |
| Unordered | High | Medium |
| Dominant | Highest | Lower (heuristic) |
### Ranked `getAll()` equivalence
### Ranked (recommended)
`MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS` merges subtrees whose `getAll()` results remain equivalent for every reachable key suffix and whose local result ordering is the same.
* preserves full `getAll()` ordering
* safest and most predictable
This is the best general-purpose choice when result ordering and ambiguity handling matter. It preserves ranked multi-result semantics while still achieving useful structural reduction.
### Unordered
This is the recommended default for most users.
* ignores ordering differences
* higher compression, but less precise semantics
### Unordered `getAll()` equivalence
### Dominant
`MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS` also uses `getAll()`-level equivalence, but it ignores local ordering differences in addition to absolute frequencies.
* focuses on the most frequent result
* useful when only `get()` is relevant
* may lose secondary candidates
This can yield stronger reduction, but it also weakens the precision of ordered multi-result semantics.
Choose this mode only when the application does not depend on the ordering of alternative results.
### Dominant `get()` equivalence
## Output format
`MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS` focuses on preserving preferred-result semantics for `get()`, subject to dominance thresholds.
The compiled file:
If a node does not satisfy the configured dominance constraints, compilation falls back to ranked `getAll()` semantics for that node to avoid unsafe over-reduction.
* is a binary representation of the trie
* uses **GZip compression**
* is optimized for:
This mode is most suitable when the application primarily consumes the preferred result and does not rely on preserving richer ambiguity information.
* fast loading
* minimal memory footprint
## Recommended usage patterns
Typical properties:
### Use offline preparation
* small file size
* fast deserialization
* no runtime preprocessing required
The CLI is best used as a preparation step during packaging, deployment, or controlled artifact generation. This keeps compilation outside the runtime startup path and allows services to load only the finished binary trie.
### Treat compiled files as versioned assets
A `.radixor.gz` file should be handled as a versioned output artifact. It represents a specific dictionary state, a specific reduction mode, and, where relevant, specific dominant-result thresholds.
### Choose reduction mode deliberately
The ranked `getAll()` mode is the safest default. The unordered and dominant modes should be chosen only when their trade-offs are acceptable for the consuming application.
### Expect memory pressure during preparation, not runtime
Compilation is usually a one-time step and is generally fast. The more important operational consideration is memory usage during preparation, because the dictionary-derived mutable structure exists before reduction compacts it into the final read-only trie. This is especially relevant for very large source dictionaries.
## Example workflow
### 1. Prepare dictionary
### 1. Prepare a dictionary
```
```text
run running runs ran
connect connected connecting
```
### 2. Compile
### 2. Compile it
```bash
java org.egothor.stemmer.Compile \
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original
--input ./data/stemmer.txt \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--store-original
```
### 3. Use in application
### 3. Load it in an application
```java
FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.StemmerPatchTrieLoader;
final FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
```
## Exit codes and error handling
The CLI uses three exit outcomes:
## Error handling
- `0` for success,
- `1` for processing failures such as I/O or compilation errors,
- `2` for invalid command-line usage.
The CLI reports:
When argument parsing fails, the CLI prints the error message, prints the usage summary, and exits with usage error status.
* missing input file
* invalid arguments
* I/O failures
* parsing errors
When compilation fails during processing, the CLI prints a `Compilation failed: ...` message to standard error and exits with processing error status.
Typical exit codes:
Examples of failure conditions include:
* `0` success
* non-zero failure
Error details are printed to standard error.
## Performance considerations
### Compilation
* typically CPU-bound
* depends on dictionary size and reduction mode
### Output size
* depends on:
* dictionary completeness
* reduction strategy
* can vary significantly between modes
### Runtime impact
* compiled tries are optimized for:
* fast lookup
* low allocation
* predictable latency
## Best practices
### Use offline compilation
* compile dictionaries during build or deployment
* do not compile on application startup
### Version your artifacts
* treat `.radixor.gz` files as versioned assets
* store them alongside application releases
### Choose reduction mode deliberately
* use **ranked** for correctness
* use **dominant** only if you fully understand the trade-offs
### Keep dictionaries clean
* better input → better compiled output
* avoid noise and inconsistencies
## Integration tips
* store compiled files under `resources/` or a dedicated directory
* load them once and reuse the trie instance
* avoid repeated loading in frequently executed code paths (for example, per-request processing)
- missing required arguments,
- unknown arguments,
- invalid integer values for dominant thresholds,
- missing input files,
- unreadable input,
- existing output file without `--overwrite`,
- general I/O failures during reading or writing.
## Relation to programmatic usage
The CLI and the programmatic API implement the same conceptual preparation step. The CLI is the operationally convenient choice when you want a ready-made binary artifact. The programmatic API is the better fit when compilation must be integrated directly into custom Java workflows.
## Next steps
* [Dictionary format](dictionary-format.md)
* [Programmatic usage](programmatic-usage.md)
* [Quick start](quick-start.md)
## Summary
The `Compile` CLI is the bridge between:
* human-readable dictionary data
* optimized runtime stemmer tables
It enables a clean separation between:
* data preparation
* runtime execution
and is the preferred way to prepare Radixor for production use.
- [Dictionary format](dictionary-format.md)
- [Quick start](quick-start.md)
- [Programmatic usage](programmatic-usage.md)
- [Architecture and reduction](architecture-and-reduction.md)