Files
Radixor/docs/cli-compilation.md

5.3 KiB
Raw Blame History

CLI Compilation

Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.

This is the recommended workflow for deployment environments, as it separates:

  • dictionary preparation (offline)
  • stemming execution (runtime)

Overview

The Compile tool:

  1. reads a line-oriented dictionary file
  2. converts wordstem pairs into patch commands
  3. builds a trie structure
  4. applies subtree reduction
  5. writes a compressed binary artifact

The output is a .radixor.gz file suitable for fast runtime loading.

Basic usage

java org.egothor.stemmer.Compile \
  --input ./data/stemmer.txt \
  --output ./build/english.radixor.gz \
  --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
  --store-original \
  --overwrite

Required arguments

--input

Path to the source dictionary file.

--input ./data/stemmer.txt

--output

Path to the output binary file.

  • parent directories are created automatically
  • output is written as GZip-compressed binary
--output ./build/english.radixor.gz

Optional arguments

--reduction-mode

Controls how aggressively the trie is reduced during compilation.

Available values:

  • MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
  • MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS
  • MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS

Example:

--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS

Recommendation

Use:

MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS

This provides:

  • safe behavior
  • deterministic ordering
  • good compression

--store-original

Stores the stem itself as a no-op mapping.

--store-original

Effect:

  • ensures that canonical forms are always resolvable
  • improves robustness in real-world inputs

Recommended for most use cases.

--overwrite

Allows overwriting an existing output file.

--overwrite

Without this flag:

  • compilation fails if the output file already exists

Reduction strategy explained

Reduction merges semantically equivalent subtrees to reduce memory and file size.

Trade-offs:

Mode Compression Behavioral fidelity
Ranked Medium High
Unordered High Medium
Dominant Highest Lower (heuristic)
  • preserves full getAll() ordering
  • safest and most predictable

Unordered

  • ignores ordering differences
  • higher compression, but less precise semantics

Dominant

  • focuses on the most frequent result
  • useful when only get() is relevant
  • may lose secondary candidates

Output format

The compiled file:

  • is a binary representation of the trie

  • uses GZip compression

  • is optimized for:

    • fast loading
    • minimal memory footprint

Typical properties:

  • small file size
  • fast deserialization
  • no runtime preprocessing required

Example workflow

1. Prepare dictionary

run running runs ran
connect connected connecting

2. Compile

java org.egothor.stemmer.Compile \
  --input ./data/stemmer.txt \
  --output ./build/english.radixor.gz \
  --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
  --store-original

3. Use in application

FrequencyTrie<String> trie =
    StemmerPatchTrieLoader.loadBinary("english.radixor.gz");

Error handling

The CLI reports:

  • missing input file
  • invalid arguments
  • I/O failures
  • parsing errors

Typical exit codes:

  • 0 success
  • non-zero failure

Error details are printed to standard error.

Performance considerations

Compilation

  • typically CPU-bound
  • depends on dictionary size and reduction mode

Output size

  • depends on:

    • dictionary completeness
    • reduction strategy
  • can vary significantly between modes

Runtime impact

  • compiled tries are optimized for:

    • fast lookup
    • low allocation
    • predictable latency

Best practices

Use offline compilation

  • compile dictionaries during build or deployment
  • do not compile on application startup

Version your artifacts

  • treat .radixor.gz files as versioned assets
  • store them alongside application releases

Choose reduction mode deliberately

  • use ranked for correctness
  • use dominant only if you fully understand the trade-offs

Keep dictionaries clean

  • better input → better compiled output
  • avoid noise and inconsistencies

Integration tips

  • store compiled files under resources/ or a dedicated directory
  • load them once and reuse the trie instance
  • avoid repeated loading in frequently executed code paths (for example, per-request processing)

Next steps

Summary

The Compile CLI is the bridge between:

  • human-readable dictionary data
  • optimized runtime stemmer tables

It enables a clean separation between:

  • data preparation
  • runtime execution

and is the preferred way to prepare Radixor for production use.