Refine stemmer core, compiled trie workflow, tests, and public documentation
feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
This commit is contained in:
305
docs/cli-compilation.md
Normal file
305
docs/cli-compilation.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# CLI Compilation
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.
|
||||
|
||||
This is the recommended workflow for deployment environments, as it separates:
|
||||
|
||||
- dictionary preparation (offline)
|
||||
- stemming execution (runtime)
|
||||
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
The `Compile` tool:
|
||||
|
||||
1. reads a line-oriented dictionary file
|
||||
2. converts word–stem pairs into patch commands
|
||||
3. builds a trie structure
|
||||
4. applies subtree reduction
|
||||
5. writes a compressed binary artifact
|
||||
|
||||
The output is a `.radixor.gz` file suitable for fast runtime loading.
|
||||
|
||||
|
||||
|
||||
## Basic usage
|
||||
|
||||
```bash
|
||||
java org.egothor.stemmer.Compile \
|
||||
--input ./data/stemmer.txt \
|
||||
--output ./build/english.radixor.gz \
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
|
||||
--store-original \
|
||||
--overwrite
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Required arguments
|
||||
|
||||
### `--input`
|
||||
|
||||
Path to the source dictionary file.
|
||||
|
||||
* must be in the [dictionary format](dictionary-format.md)
|
||||
* must be readable
|
||||
* UTF-8 encoding is expected
|
||||
|
||||
```
|
||||
--input ./data/stemmer.txt
|
||||
```
|
||||
|
||||
### `--output`
|
||||
|
||||
Path to the output binary file.
|
||||
|
||||
* parent directories are created automatically
|
||||
* output is written as **GZip-compressed binary**
|
||||
|
||||
```
|
||||
--output ./build/english.radixor.gz
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Optional arguments
|
||||
|
||||
### `--reduction-mode`
|
||||
|
||||
Controls how aggressively the trie is reduced during compilation.
|
||||
|
||||
Available values:
|
||||
|
||||
* `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
|
||||
* `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
|
||||
* `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
```
|
||||
|
||||
#### Recommendation
|
||||
|
||||
Use:
|
||||
|
||||
```
|
||||
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
```
|
||||
|
||||
This provides:
|
||||
|
||||
* safe behavior
|
||||
* deterministic ordering
|
||||
* good compression
|
||||
|
||||
|
||||
|
||||
### `--store-original`
|
||||
|
||||
Stores the stem itself as a no-op mapping.
|
||||
|
||||
```
|
||||
--store-original
|
||||
```
|
||||
|
||||
Effect:
|
||||
|
||||
* ensures that canonical forms are always resolvable
|
||||
* improves robustness in real-world inputs
|
||||
|
||||
Recommended for most use cases.
|
||||
|
||||
|
||||
|
||||
### `--overwrite`
|
||||
|
||||
Allows overwriting an existing output file.
|
||||
|
||||
```
|
||||
--overwrite
|
||||
```
|
||||
|
||||
Without this flag:
|
||||
|
||||
* compilation fails if the output file already exists
|
||||
|
||||
|
||||
|
||||
## Reduction strategy explained
|
||||
|
||||
Reduction merges semantically equivalent subtrees to reduce memory and file size.
|
||||
|
||||
Trade-offs:
|
||||
|
||||
| Mode | Compression | Behavioral fidelity |
|
||||
| --------- | ----------- | ------------------- |
|
||||
| Ranked | Medium | High |
|
||||
| Unordered | High | Medium |
|
||||
| Dominant | Highest | Lower (heuristic) |
|
||||
|
||||
### Ranked (recommended)
|
||||
|
||||
* preserves full `getAll()` ordering
|
||||
* safest and most predictable
|
||||
|
||||
### Unordered
|
||||
|
||||
* ignores ordering differences
|
||||
* higher compression, but less precise semantics
|
||||
|
||||
### Dominant
|
||||
|
||||
* focuses on the most frequent result
|
||||
* useful when only `get()` is relevant
|
||||
* may lose secondary candidates
|
||||
|
||||
|
||||
|
||||
## Output format
|
||||
|
||||
The compiled file:
|
||||
|
||||
* is a binary representation of the trie
|
||||
* uses **GZip compression**
|
||||
* is optimized for:
|
||||
|
||||
* fast loading
|
||||
* minimal memory footprint
|
||||
|
||||
Typical properties:
|
||||
|
||||
* small file size
|
||||
* fast deserialization
|
||||
* no runtime preprocessing required
|
||||
|
||||
|
||||
|
||||
## Example workflow
|
||||
|
||||
### 1. Prepare dictionary
|
||||
|
||||
```
|
||||
run running runs ran
|
||||
connect connected connecting
|
||||
```
|
||||
|
||||
### 2. Compile
|
||||
|
||||
```bash
|
||||
java org.egothor.stemmer.Compile \
|
||||
--input ./data/stemmer.txt \
|
||||
--output ./build/english.radixor.gz \
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
|
||||
--store-original
|
||||
```
|
||||
|
||||
### 3. Use in application
|
||||
|
||||
```java
|
||||
FrequencyTrie<String> trie =
|
||||
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Error handling
|
||||
|
||||
The CLI reports:
|
||||
|
||||
* missing input file
|
||||
* invalid arguments
|
||||
* I/O failures
|
||||
* parsing errors
|
||||
|
||||
Typical exit codes:
|
||||
|
||||
* `0` – success
|
||||
* non-zero – failure
|
||||
|
||||
Error details are printed to standard error.
|
||||
|
||||
|
||||
|
||||
## Performance considerations
|
||||
|
||||
### Compilation
|
||||
|
||||
* typically CPU-bound
|
||||
* depends on dictionary size and reduction mode
|
||||
|
||||
### Output size
|
||||
|
||||
* depends on:
|
||||
|
||||
* dictionary completeness
|
||||
* reduction strategy
|
||||
* can vary significantly between modes
|
||||
|
||||
### Runtime impact
|
||||
|
||||
* compiled tries are optimized for:
|
||||
|
||||
* fast lookup
|
||||
* low allocation
|
||||
* predictable latency
|
||||
|
||||
|
||||
|
||||
## Best practices
|
||||
|
||||
### Use offline compilation
|
||||
|
||||
* compile dictionaries during build or deployment
|
||||
* do not compile on application startup
|
||||
|
||||
### Version your artifacts
|
||||
|
||||
* treat `.radixor.gz` files as versioned assets
|
||||
* store them alongside application releases
|
||||
|
||||
### Choose reduction mode deliberately
|
||||
|
||||
* use **ranked** for correctness
|
||||
* use **dominant** only if you fully understand the trade-offs
|
||||
|
||||
### Keep dictionaries clean
|
||||
|
||||
* better input → better compiled output
|
||||
* avoid noise and inconsistencies
|
||||
|
||||
|
||||
|
||||
## Integration tips
|
||||
|
||||
* store compiled files under `resources/` or a dedicated directory
|
||||
* load them once and reuse the trie instance
|
||||
* avoid repeated loading in frequently executed code paths (for example, per-request processing)
|
||||
|
||||
|
||||
|
||||
## Next steps
|
||||
|
||||
* [Dictionary format](dictionary-format.md)
|
||||
* [Programmatic usage](programmatic-usage.md)
|
||||
* [Quick start](quick-start.md)
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
The `Compile` CLI is the bridge between:
|
||||
|
||||
* human-readable dictionary data
|
||||
* optimized runtime stemmer tables
|
||||
|
||||
It enables a clean separation between:
|
||||
|
||||
* data preparation
|
||||
* runtime execution
|
||||
|
||||
and is the preferred way to prepare Radixor for production use.
|
||||
Reference in New Issue
Block a user