Practical improvements

fix: cli-compilation doc is missing some params
chore: ExperimentCli is not relevant for JaCoCo
feat: human-readable format of trie metadata
fix: some new JUnit-s added
This commit is contained in:
2026-04-23 23:43:25 +02:00
parent 8785f2b7cb
commit 041b7f43fb
9 changed files with 348 additions and 60 deletions

View File

@@ -24,6 +24,7 @@ java org.egothor.stemmer.Compile \
--input ./data/stemmer.tsv \
--output ./build/english.radixor.gz \
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
--case-processing-mode LOWERCASE_WITH_LOCALE_ROOT \
--store-original \
--overwrite
```
@@ -37,6 +38,8 @@ The CLI supports the following arguments:
--output <file>
--reduction-mode <mode>
[--store-original]
[--right-to-left]
[--case-processing-mode <mode>]
[--dominant-winner-min-percent <1..100>]
[--dominant-winner-over-second-ratio <1..n>]
[--overwrite]
@@ -95,6 +98,31 @@ When this flag is present, the canonical stem itself is inserted using the no-op
This is usually a sensible default for real dictionaries because it ensures that canonical forms are directly representable in the compiled trie rather than relying only on their variants.
### `--right-to-left`
When present, compilation uses forward traversal (`WordTraversalDirection.FORWARD`) so stored forms are processed from their logical beginning.
```text
--right-to-left
```
This option is intended for right-to-left languages where affix behavior should operate on the written form without externally reversing words.
### `--case-processing-mode <mode>`
Controls dictionary key normalization during compilation and lookup.
Supported values are:
- `LOWERCASE_WITH_LOCALE_ROOT` (default)
- `AS_IS`
Example:
```text
--case-processing-mode AS_IS
```
### `--dominant-winner-min-percent <1..100>`
Sets the minimum winner percentage used by dominant-result reduction settings.
@@ -177,6 +205,8 @@ The CLI is best used as a preparation step during packaging, deployment, or cont
A `.radixor.gz` file should be handled as a versioned output artifact. It represents a specific dictionary state, a specific reduction mode, and, where relevant, specific dominant-result thresholds.
Compiled tries also persist a human-readable metadata block (`key=value` lines) that includes traversal direction, RTL indicator, reduction mode, case-processing mode, and dominant thresholds. After decompression, you can inspect this block directly to identify what dictionary/trie configuration the artifact contains.
### Choose reduction mode deliberately
The ranked `getAll()` mode is the safest default. The unordered and dominant modes should be chosen only when their trade-offs are acceptable for the consuming application.