Radixor/docs/cli-compilation.md

# CLI Compilation

Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.

This is the recommended workflow for deployment environments, as it separates:

- dictionary preparation (offline)
- stemming execution (runtime)


## Overview

The `Compile` tool:

1. reads a line-oriented dictionary file
2. converts word–stem pairs into patch commands
3. builds a trie structure
4. applies subtree reduction
5. writes a compressed binary artifact

The output is a `.radixor.gz` file suitable for fast runtime loading.


## Basic usage

```bash
java org.egothor.stemmer.Compile \
  --input ./data/stemmer.txt \
  --output ./build/english.radixor.gz \
  --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
  --store-original \
  --overwrite
```


## Required arguments

### `--input`

Path to the source dictionary file.

* must be in the [dictionary format](dictionary-format.md)
* must be readable
* UTF-8 encoding is expected

```
--input ./data/stemmer.txt
```

### `--output`

Path to the output binary file.

* parent directories are created automatically
* output is written as **GZip-compressed binary**

```
--output ./build/english.radixor.gz
```


## Optional arguments

### `--reduction-mode`

Controls how aggressively the trie is reduced during compilation.

Available values:

* `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
* `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
* `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`

Example:

```
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```

#### Recommendation

Use:

```
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
```

This provides:

* safe behavior
* deterministic ordering
* good compression


### `--store-original`

Stores the stem itself as a no-op mapping.

```
--store-original
```

Effect:

* ensures that canonical forms are always resolvable
* improves robustness in real-world inputs

Recommended for most use cases.


### `--overwrite`

Allows overwriting an existing output file.

```
--overwrite
```

Without this flag:

* compilation fails if the output file already exists


## Reduction strategy explained

Reduction merges semantically equivalent subtrees to reduce memory and file size.

Trade-offs:

| Mode      | Compression | Behavioral fidelity |
| --------- | ----------- | ------------------- |
| Ranked    | Medium      | High                |
| Unordered | High        | Medium              |
| Dominant  | Highest     | Lower (heuristic)   |

### Ranked (recommended)

* preserves full `getAll()` ordering
* safest and most predictable

### Unordered

* ignores ordering differences
* higher compression, but less precise semantics

### Dominant

* focuses on the most frequent result
* useful when only `get()` is relevant
* may lose secondary candidates


## Output format

The compiled file:

* is a binary representation of the trie
* uses **GZip compression**
* is optimized for:

  * fast loading
  * minimal memory footprint

Typical properties:

* small file size
* fast deserialization
* no runtime preprocessing required


## Example workflow

### 1. Prepare dictionary

```
run running runs ran
connect connected connecting
```

### 2. Compile

```bash
java org.egothor.stemmer.Compile \
  --input ./data/stemmer.txt \
  --output ./build/english.radixor.gz \
  --reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
  --store-original
```

### 3. Use in application

```java
FrequencyTrie<String> trie =
    StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
```


## Error handling

The CLI reports:

* missing input file
* invalid arguments
* I/O failures
* parsing errors

Typical exit codes:

* `0` – success
* non-zero – failure

Error details are printed to standard error.


## Performance considerations

### Compilation

* typically CPU-bound
* depends on dictionary size and reduction mode

### Output size

* depends on:

  * dictionary completeness
  * reduction strategy
* can vary significantly between modes

### Runtime impact

* compiled tries are optimized for:

  * fast lookup
  * low allocation
  * predictable latency


## Best practices

### Use offline compilation

* compile dictionaries during build or deployment
* do not compile on application startup

### Version your artifacts

* treat `.radixor.gz` files as versioned assets
* store them alongside application releases

### Choose reduction mode deliberately

* use **ranked** for correctness
* use **dominant** only if you fully understand the trade-offs

### Keep dictionaries clean

* better input → better compiled output
* avoid noise and inconsistencies


## Integration tips

* store compiled files under `resources/` or a dedicated directory
* load them once and reuse the trie instance
* avoid repeated loading in frequently executed code paths (for example, per-request processing)


## Next steps

* [Dictionary format](dictionary-format.md)
* [Programmatic usage](programmatic-usage.md)
* [Quick start](quick-start.md)


## Summary

The `Compile` CLI is the bridge between:

* human-readable dictionary data
* optimized runtime stemmer tables

It enables a clean separation between:

* data preparation
* runtime execution

and is the preferred way to prepare Radixor for production use.