Refine stemmer core, compiled trie workflow, tests, and public documentation
feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
This commit is contained in:
470
docs/architecture-and-reduction.md
Normal file
470
docs/architecture-and-reduction.md
Normal file
@@ -0,0 +1,470 @@
|
||||
# Architecture and Reduction
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
This document describes the internal architecture of **Radixor** and the principles behind its **trie compilation and reduction model**.
|
||||
|
||||
It explains:
|
||||
|
||||
- how data flows from dictionary input to compiled trie
|
||||
- how patch-command tries are structured
|
||||
- how subtree reduction works
|
||||
- how reduction modes affect behavior and size
|
||||
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
Radixor transforms dictionary data into an optimized runtime structure through three stages:
|
||||
|
||||
1. **Mutable construction**
|
||||
2. **Reduction (canonicalization)**
|
||||
3. **Compilation (freezing)**
|
||||
|
||||
```
|
||||
Dictionary → Mutable trie → Reduced trie → Compiled trie
|
||||
```
|
||||
|
||||
Each stage has a distinct purpose:
|
||||
|
||||
| Stage | Purpose | Structure |
|
||||
|------------|----------------------------------|-------------------------|
|
||||
| Build | Collect mappings | `MutableNode` |
|
||||
| Reduction | Merge equivalent subtrees | `ReducedNode` |
|
||||
| Compilation | Optimize for runtime lookup | `CompiledNode` |
|
||||
|
||||
|
||||
|
||||
## Core data model
|
||||
|
||||
### Patch-command trie
|
||||
|
||||
Radixor stores **patch commands** instead of stems directly.
|
||||
|
||||
- keys: word forms
|
||||
- values: transformation commands
|
||||
- structure: trie (prefix tree)
|
||||
|
||||
At runtime:
|
||||
|
||||
1. the word is traversed through the trie
|
||||
2. a patch command is retrieved
|
||||
3. the patch is applied to reconstruct the stem
|
||||
|
||||
|
||||
|
||||
## Stage 1: Mutable construction
|
||||
|
||||
The builder (`FrequencyTrie.Builder`) constructs a trie using:
|
||||
|
||||
- `MutableNode`
|
||||
- maps of children (`char → node`)
|
||||
- maps of value counts (`value → frequency`)
|
||||
|
||||
Characteristics:
|
||||
|
||||
- insertion-order preserving
|
||||
- mutable
|
||||
- optimized for building, not querying
|
||||
|
||||
Example structure:
|
||||
|
||||
```
|
||||
g
|
||||
└─ n
|
||||
└─ i
|
||||
└─ n
|
||||
└─ n
|
||||
└─ u
|
||||
└─ r
|
||||
└─ (values: {
|
||||
"<patch-command-1>": 3,
|
||||
"<patch-command-2>": 1
|
||||
})
|
||||
```
|
||||
|
||||
This example represents the word "running", stored in reversed form.
|
||||
|
||||
- each edge corresponds to one character of the word
|
||||
- the path is traversed from the end of the word toward the beginning
|
||||
- the terminal node stores one or more patch commands together with their local frequencies
|
||||
|
||||
The values represent transformations from the word form to candidate stems, and the counts indicate how often each mapping was observed during construction.
|
||||
|
||||
Note: Radixor stores word forms in reversed order so that suffix-based transformations can be matched efficiently in a trie.
|
||||
|
||||
|
||||
## Local value summary
|
||||
|
||||
Before reduction, each node is summarized using `LocalValueSummary`.
|
||||
|
||||
It computes:
|
||||
|
||||
- ordered values (by frequency)
|
||||
- aligned counts
|
||||
- total frequency
|
||||
- dominant value (if any)
|
||||
- second-best value
|
||||
|
||||
This summary is critical for:
|
||||
|
||||
- deterministic ordering
|
||||
- reduction decisions
|
||||
- dominance evaluation
|
||||
|
||||
|
||||
|
||||
## Stage 2: Reduction (canonicalization)
|
||||
|
||||
Reduction is the process of merging **semantically equivalent subtrees**.
|
||||
|
||||
### Why reduction exists
|
||||
|
||||
Without reduction:
|
||||
|
||||
- trie size grows linearly with input data
|
||||
- repeated patterns are duplicated
|
||||
|
||||
With reduction:
|
||||
|
||||
- identical subtrees are shared
|
||||
- memory footprint is reduced
|
||||
- binary output becomes smaller
|
||||
|
||||
|
||||
|
||||
## Reduction signature
|
||||
|
||||
Each subtree is represented by a **ReductionSignature**.
|
||||
|
||||
A signature consists of:
|
||||
|
||||
1. **local descriptor** (node semantics)
|
||||
2. **child descriptors** (structure)
|
||||
|
||||
```
|
||||
Signature = (LocalDescriptor, SortedChildDescriptors)
|
||||
```
|
||||
|
||||
Two subtrees are merged if their signatures are equal.
|
||||
|
||||
|
||||
|
||||
## Local descriptors
|
||||
|
||||
The local descriptor encodes how values at a node are interpreted.
|
||||
|
||||
Radixor supports three descriptor types:
|
||||
|
||||
### 1. Ranked descriptor
|
||||
|
||||
Preserves:
|
||||
|
||||
- full ordering of values (`getAll()`)
|
||||
|
||||
Uses:
|
||||
|
||||
- ordered value list
|
||||
|
||||
Best for:
|
||||
|
||||
- correctness
|
||||
- deterministic multi-result behavior
|
||||
|
||||
|
||||
|
||||
### 2. Unordered descriptor
|
||||
|
||||
Preserves:
|
||||
|
||||
- only membership (set of values)
|
||||
|
||||
Ignores:
|
||||
|
||||
- ordering differences
|
||||
|
||||
Best for:
|
||||
|
||||
- higher compression
|
||||
- use cases where ordering is irrelevant
|
||||
|
||||
|
||||
|
||||
### 3. Dominant descriptor
|
||||
|
||||
Preserves:
|
||||
|
||||
- only the dominant value (`get()`)
|
||||
|
||||
Condition:
|
||||
|
||||
- dominant value must satisfy thresholds:
|
||||
- minimum percentage
|
||||
- ratio over second-best
|
||||
|
||||
Fallback:
|
||||
|
||||
- if dominance is not strong enough → ranked descriptor is used
|
||||
|
||||
Best for:
|
||||
|
||||
- maximum compression
|
||||
- single-result workflows
|
||||
|
||||
|
||||
|
||||
## Child descriptors
|
||||
|
||||
Each child is represented as:
|
||||
|
||||
```
|
||||
(edge character, child signature)
|
||||
```
|
||||
|
||||
Children are sorted by edge character to ensure:
|
||||
|
||||
- deterministic signatures
|
||||
- stable equality comparisons
|
||||
|
||||
|
||||
|
||||
## Reduction context
|
||||
|
||||
`ReductionContext` maintains:
|
||||
|
||||
- mapping: `ReductionSignature → ReducedNode`
|
||||
- canonical instances of subtrees
|
||||
|
||||
Workflow:
|
||||
|
||||
1. compute signature
|
||||
2. check if already exists
|
||||
3. reuse existing node or create new one
|
||||
|
||||
This ensures:
|
||||
|
||||
- structural sharing
|
||||
- no duplicate equivalent subtrees
|
||||
|
||||
|
||||
|
||||
## Reduced nodes
|
||||
|
||||
`ReducedNode` represents:
|
||||
|
||||
- canonical subtree
|
||||
- aggregated value counts
|
||||
- canonical children
|
||||
|
||||
It supports:
|
||||
|
||||
- merging local counts
|
||||
- verifying structural consistency
|
||||
|
||||
At this stage:
|
||||
|
||||
- structure is canonical
|
||||
- still mutable (internally)
|
||||
|
||||
|
||||
|
||||
## Stage 3: Compilation (freezing)
|
||||
|
||||
The reduced trie is converted into a **CompiledNode** structure.
|
||||
|
||||
### CompiledNode characteristics
|
||||
|
||||
- immutable
|
||||
- array-based storage
|
||||
- optimized for fast lookup
|
||||
|
||||
Fields:
|
||||
|
||||
- `char[] edgeLabels`
|
||||
- `CompiledNode[] children`
|
||||
- `V[] orderedValues`
|
||||
- `int[] orderedCounts`
|
||||
|
||||
|
||||
|
||||
## Lookup algorithm
|
||||
|
||||
Runtime lookup:
|
||||
|
||||
1. traverse trie using `edgeLabels` (matching characters from the end of the word toward the beginning)
|
||||
2. binary search per node
|
||||
3. retrieve values
|
||||
4. apply patch command
|
||||
|
||||
Properties:
|
||||
|
||||
- O(length of word)
|
||||
- low memory overhead
|
||||
- minimal memory allocation during lookup; patch application produces the resulting string
|
||||
|
||||
|
||||
## Deterministic ordering
|
||||
|
||||
Value ordering is deterministic and stable:
|
||||
|
||||
1. higher frequency first
|
||||
2. shorter string first
|
||||
3. lexicographically smaller
|
||||
4. insertion order
|
||||
|
||||
This guarantees:
|
||||
|
||||
- reproducible builds
|
||||
- stable query results
|
||||
- predictable ranking
|
||||
|
||||
|
||||
|
||||
## Reduction modes
|
||||
|
||||
Reduction modes control how local descriptors are chosen.
|
||||
|
||||
### Ranked mode
|
||||
|
||||
```
|
||||
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
```
|
||||
|
||||
- preserves full semantics
|
||||
- safest option
|
||||
- recommended default
|
||||
|
||||
|
||||
|
||||
### Unordered mode
|
||||
|
||||
```
|
||||
MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS
|
||||
```
|
||||
|
||||
- ignores ordering
|
||||
- higher compression
|
||||
- slightly weaker semantics
|
||||
|
||||
|
||||
|
||||
### Dominant mode
|
||||
|
||||
```
|
||||
MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS
|
||||
```
|
||||
|
||||
- keeps only dominant result
|
||||
- highest compression
|
||||
- may lose alternative candidates
|
||||
|
||||
|
||||
|
||||
## Trade-offs
|
||||
|
||||
| Aspect | Ranked | Unordered | Dominant |
|
||||
|---------------|--------|----------|----------|
|
||||
| Compression | Medium | High | Highest |
|
||||
| Accuracy | High | Medium | Lower |
|
||||
| getAll() | Full | Partial | Limited |
|
||||
| get() | Exact | Exact | Heuristic|
|
||||
|
||||
|
||||
|
||||
## Deserialization model
|
||||
|
||||
Binary loading uses:
|
||||
|
||||
- `NodeData` as intermediate representation
|
||||
- reconstruction of `CompiledNode`
|
||||
|
||||
This separates:
|
||||
|
||||
- I/O format
|
||||
- in-memory structure
|
||||
|
||||
|
||||
|
||||
## Why this architecture works
|
||||
|
||||
Radixor achieves:
|
||||
|
||||
### Compactness
|
||||
|
||||
- subtree sharing
|
||||
- efficient encoding
|
||||
- compressed binary output
|
||||
|
||||
### Performance
|
||||
|
||||
- array-based lookup
|
||||
- no runtime reduction
|
||||
- minimal branching
|
||||
|
||||
### Flexibility
|
||||
|
||||
- configurable reduction strategies
|
||||
- multiple result support
|
||||
- dictionary-driven behavior
|
||||
|
||||
### Determinism
|
||||
|
||||
- stable ordering
|
||||
- canonical signatures
|
||||
- reproducible builds
|
||||
|
||||
|
||||
|
||||
## Design philosophy
|
||||
|
||||
The architecture reflects a few key principles:
|
||||
|
||||
- separate build-time complexity from runtime simplicity
|
||||
- encode semantics explicitly (not implicitly in code)
|
||||
- favor deterministic behavior over heuristic shortcuts
|
||||
- allow controlled trade-offs between size and fidelity
|
||||
|
||||
|
||||
|
||||
## When to tune reduction
|
||||
|
||||
You should consider changing reduction mode when:
|
||||
|
||||
- binary size is too large
|
||||
- memory footprint must be minimized
|
||||
- only single-result stemming is needed
|
||||
|
||||
Otherwise:
|
||||
|
||||
**use ranked mode by default**
|
||||
|
||||
|
||||
|
||||
## Next steps
|
||||
|
||||
- [Programmatic usage](programmatic-usage.md)
|
||||
- [CLI compilation](cli-compilation.md)
|
||||
- [Dictionary format](dictionary-format.md)
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Radixor’s architecture is built around:
|
||||
|
||||
- patch-command tries
|
||||
- canonical subtree reduction
|
||||
- immutable compiled structures
|
||||
|
||||
This design allows the system to remain:
|
||||
|
||||
- fast
|
||||
- compact
|
||||
- deterministic
|
||||
- adaptable
|
||||
|
||||
while still supporting advanced use cases such as:
|
||||
|
||||
- ambiguity-aware stemming
|
||||
- dictionary evolution
|
||||
- controlled trade-offs between size and behavior
|
||||
252
docs/built-in-languages.md
Normal file
252
docs/built-in-languages.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Built-in Languages
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
|
||||
|
||||
These built-in resources are useful for:
|
||||
|
||||
- quick integration
|
||||
- testing and evaluation
|
||||
- reference behavior
|
||||
- prototyping search pipelines
|
||||
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
Bundled dictionaries are exposed through:
|
||||
|
||||
```java
|
||||
StemmerPatchTrieLoader.Language
|
||||
```
|
||||
|
||||
They are packaged with the library and loaded from the classpath.
|
||||
|
||||
|
||||
|
||||
## Supported languages
|
||||
|
||||
The following language identifiers are currently available:
|
||||
|
||||
| Language | Enum constant | Description |
|
||||
|----------|------------------|------------------------------|
|
||||
| Danish | `DA_DK` | Danish |
|
||||
| German | `DE_DE` | German |
|
||||
| Spanish | `ES_ES` | Spanish |
|
||||
| French | `FR_FR` | French |
|
||||
| Italian | `IT_IT` | Italian |
|
||||
| Dutch | `NL_NL` | Dutch |
|
||||
| Norwegian| `NO_NO` | Norwegian |
|
||||
| Portuguese| `PT_PT` | Portuguese |
|
||||
| Russian | `RU_RU` | Russian |
|
||||
| Swedish | `SV_SE` | Swedish |
|
||||
| English | `US_UK` | Standard English |
|
||||
| English | `US_UK_PROFI` | Extended English dictionary |
|
||||
|
||||
|
||||
|
||||
## Basic usage
|
||||
|
||||
Load a bundled stemmer:
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
|
||||
import org.egothor.stemmer.FrequencyTrie;
|
||||
import org.egothor.stemmer.ReductionMode;
|
||||
import org.egothor.stemmer.StemmerPatchTrieLoader;
|
||||
|
||||
public final class BuiltInExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Example: stemming with `US_UK_PROFI`
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class EnglishExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
|
||||
String word = "running";
|
||||
String patch = trie.get(word);
|
||||
String stem = PatchCommandEncoder.apply(word, patch);
|
||||
|
||||
System.out.println(word + " -> " + stem);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## `US_UK` vs `US_UK_PROFI`
|
||||
|
||||
### `US_UK`
|
||||
|
||||
* smaller dictionary
|
||||
* faster load time
|
||||
* suitable for lightweight use cases
|
||||
|
||||
### `US_UK_PROFI`
|
||||
|
||||
* larger and more complete dataset
|
||||
* better coverage of word forms
|
||||
* improved stemming quality
|
||||
* slightly larger memory footprint
|
||||
|
||||
### Recommendation
|
||||
|
||||
Use:
|
||||
|
||||
````
|
||||
US_UK_PROFI
|
||||
```
|
||||
|
||||
for most applications unless memory constraints are strict.
|
||||
|
||||
|
||||
|
||||
## How bundled dictionaries are loaded
|
||||
|
||||
Internally:
|
||||
|
||||
- dictionaries are stored as text resources
|
||||
- parsed using `StemmerDictionaryParser`
|
||||
- compiled into a trie at load time
|
||||
|
||||
This means:
|
||||
|
||||
- first load includes parsing + compilation cost
|
||||
- subsequent usage is fast
|
||||
|
||||
|
||||
|
||||
## When to use bundled languages
|
||||
|
||||
Bundled dictionaries are suitable when:
|
||||
|
||||
- you need quick results without preparing custom data
|
||||
- you are prototyping or experimenting
|
||||
- your language requirements match the provided datasets
|
||||
|
||||
|
||||
|
||||
## When to use custom dictionaries
|
||||
|
||||
You should prefer custom dictionaries when:
|
||||
|
||||
- domain-specific vocabulary is important
|
||||
- accuracy requirements are high
|
||||
- you need full control over stemming behavior
|
||||
|
||||
Typical examples:
|
||||
|
||||
- technical terminology
|
||||
- product catalogs
|
||||
- biomedical text
|
||||
- legal or financial language
|
||||
|
||||
|
||||
|
||||
## Production recommendation
|
||||
|
||||
For production systems:
|
||||
|
||||
1. Load a bundled dictionary
|
||||
2. Extend it with domain-specific terms (optional)
|
||||
3. Compile it into a binary `.radixor.gz` file
|
||||
4. Deploy the compiled artifact
|
||||
5. Load it using `loadBinary(...)`
|
||||
|
||||
This avoids:
|
||||
|
||||
- runtime parsing overhead
|
||||
- repeated compilation
|
||||
- startup latency
|
||||
|
||||
|
||||
|
||||
## Example workflow
|
||||
|
||||
```java
|
||||
// 1. Load bundled dictionary
|
||||
FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
|
||||
// 2. Modify (optional)
|
||||
FrequencyTrie.Builder<String> builder =
|
||||
FrequencyTrieBuilders.copyOf(
|
||||
base,
|
||||
String[]::new,
|
||||
ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
)
|
||||
);
|
||||
|
||||
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
|
||||
|
||||
// 3. Compile
|
||||
FrequencyTrie<String> compiled = builder.build();
|
||||
|
||||
// 4. Save
|
||||
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Limitations
|
||||
|
||||
* bundled dictionaries are **general-purpose**
|
||||
* they may not reflect:
|
||||
|
||||
* domain-specific usage
|
||||
* rare or specialized vocabulary
|
||||
* organization-specific terminology
|
||||
|
||||
|
||||
|
||||
## Next steps
|
||||
|
||||
* [Quick start](quick-start.md)
|
||||
* [Dictionary format](dictionary-format.md)
|
||||
* [CLI compilation](cli-compilation.md)
|
||||
* [Programmatic usage](programmatic-usage.md)
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Radixor’s built-in language support provides:
|
||||
|
||||
* immediate usability
|
||||
* reference datasets
|
||||
* a starting point for customization
|
||||
|
||||
For production systems, they are best used as:
|
||||
|
||||
* a baseline
|
||||
* a seed for further extension
|
||||
* a source for compiled deployment artifacts
|
||||
|
||||
305
docs/cli-compilation.md
Normal file
305
docs/cli-compilation.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# CLI Compilation
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
Radixor provides a command-line tool for compiling dictionary files into compact, production-ready binary stemmer tables.
|
||||
|
||||
This is the recommended workflow for deployment environments, as it separates:
|
||||
|
||||
- dictionary preparation (offline)
|
||||
- stemming execution (runtime)
|
||||
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
The `Compile` tool:
|
||||
|
||||
1. reads a line-oriented dictionary file
|
||||
2. converts word–stem pairs into patch commands
|
||||
3. builds a trie structure
|
||||
4. applies subtree reduction
|
||||
5. writes a compressed binary artifact
|
||||
|
||||
The output is a `.radixor.gz` file suitable for fast runtime loading.
|
||||
|
||||
|
||||
|
||||
## Basic usage
|
||||
|
||||
```bash
|
||||
java org.egothor.stemmer.Compile \
|
||||
--input ./data/stemmer.txt \
|
||||
--output ./build/english.radixor.gz \
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
|
||||
--store-original \
|
||||
--overwrite
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Required arguments
|
||||
|
||||
### `--input`
|
||||
|
||||
Path to the source dictionary file.
|
||||
|
||||
* must be in the [dictionary format](dictionary-format.md)
|
||||
* must be readable
|
||||
* UTF-8 encoding is expected
|
||||
|
||||
```
|
||||
--input ./data/stemmer.txt
|
||||
```
|
||||
|
||||
### `--output`
|
||||
|
||||
Path to the output binary file.
|
||||
|
||||
* parent directories are created automatically
|
||||
* output is written as **GZip-compressed binary**
|
||||
|
||||
```
|
||||
--output ./build/english.radixor.gz
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Optional arguments
|
||||
|
||||
### `--reduction-mode`
|
||||
|
||||
Controls how aggressively the trie is reduced during compilation.
|
||||
|
||||
Available values:
|
||||
|
||||
* `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`
|
||||
* `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`
|
||||
* `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
```
|
||||
|
||||
#### Recommendation
|
||||
|
||||
Use:
|
||||
|
||||
```
|
||||
MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
```
|
||||
|
||||
This provides:
|
||||
|
||||
* safe behavior
|
||||
* deterministic ordering
|
||||
* good compression
|
||||
|
||||
|
||||
|
||||
### `--store-original`
|
||||
|
||||
Stores the stem itself as a no-op mapping.
|
||||
|
||||
```
|
||||
--store-original
|
||||
```
|
||||
|
||||
Effect:
|
||||
|
||||
* ensures that canonical forms are always resolvable
|
||||
* improves robustness in real-world inputs
|
||||
|
||||
Recommended for most use cases.
|
||||
|
||||
|
||||
|
||||
### `--overwrite`
|
||||
|
||||
Allows overwriting an existing output file.
|
||||
|
||||
```
|
||||
--overwrite
|
||||
```
|
||||
|
||||
Without this flag:
|
||||
|
||||
* compilation fails if the output file already exists
|
||||
|
||||
|
||||
|
||||
## Reduction strategy explained
|
||||
|
||||
Reduction merges semantically equivalent subtrees to reduce memory and file size.
|
||||
|
||||
Trade-offs:
|
||||
|
||||
| Mode | Compression | Behavioral fidelity |
|
||||
| --------- | ----------- | ------------------- |
|
||||
| Ranked | Medium | High |
|
||||
| Unordered | High | Medium |
|
||||
| Dominant | Highest | Lower (heuristic) |
|
||||
|
||||
### Ranked (recommended)
|
||||
|
||||
* preserves full `getAll()` ordering
|
||||
* safest and most predictable
|
||||
|
||||
### Unordered
|
||||
|
||||
* ignores ordering differences
|
||||
* higher compression, but less precise semantics
|
||||
|
||||
### Dominant
|
||||
|
||||
* focuses on the most frequent result
|
||||
* useful when only `get()` is relevant
|
||||
* may lose secondary candidates
|
||||
|
||||
|
||||
|
||||
## Output format
|
||||
|
||||
The compiled file:
|
||||
|
||||
* is a binary representation of the trie
|
||||
* uses **GZip compression**
|
||||
* is optimized for:
|
||||
|
||||
* fast loading
|
||||
* minimal memory footprint
|
||||
|
||||
Typical properties:
|
||||
|
||||
* small file size
|
||||
* fast deserialization
|
||||
* no runtime preprocessing required
|
||||
|
||||
|
||||
|
||||
## Example workflow
|
||||
|
||||
### 1. Prepare dictionary
|
||||
|
||||
```
|
||||
run running runs ran
|
||||
connect connected connecting
|
||||
```
|
||||
|
||||
### 2. Compile
|
||||
|
||||
```bash
|
||||
java org.egothor.stemmer.Compile \
|
||||
--input ./data/stemmer.txt \
|
||||
--output ./build/english.radixor.gz \
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
|
||||
--store-original
|
||||
```
|
||||
|
||||
### 3. Use in application
|
||||
|
||||
```java
|
||||
FrequencyTrie<String> trie =
|
||||
StemmerPatchTrieLoader.loadBinary("english.radixor.gz");
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Error handling
|
||||
|
||||
The CLI reports:
|
||||
|
||||
* missing input file
|
||||
* invalid arguments
|
||||
* I/O failures
|
||||
* parsing errors
|
||||
|
||||
Typical exit codes:
|
||||
|
||||
* `0` – success
|
||||
* non-zero – failure
|
||||
|
||||
Error details are printed to standard error.
|
||||
|
||||
|
||||
|
||||
## Performance considerations
|
||||
|
||||
### Compilation
|
||||
|
||||
* typically CPU-bound
|
||||
* depends on dictionary size and reduction mode
|
||||
|
||||
### Output size
|
||||
|
||||
* depends on:
|
||||
|
||||
* dictionary completeness
|
||||
* reduction strategy
|
||||
* can vary significantly between modes
|
||||
|
||||
### Runtime impact
|
||||
|
||||
* compiled tries are optimized for:
|
||||
|
||||
* fast lookup
|
||||
* low allocation
|
||||
* predictable latency
|
||||
|
||||
|
||||
|
||||
## Best practices
|
||||
|
||||
### Use offline compilation
|
||||
|
||||
* compile dictionaries during build or deployment
|
||||
* do not compile on application startup
|
||||
|
||||
### Version your artifacts
|
||||
|
||||
* treat `.radixor.gz` files as versioned assets
|
||||
* store them alongside application releases
|
||||
|
||||
### Choose reduction mode deliberately
|
||||
|
||||
* use **ranked** for correctness
|
||||
* use **dominant** only if you fully understand the trade-offs
|
||||
|
||||
### Keep dictionaries clean
|
||||
|
||||
* better input → better compiled output
|
||||
* avoid noise and inconsistencies
|
||||
|
||||
|
||||
|
||||
## Integration tips
|
||||
|
||||
* store compiled files under `resources/` or a dedicated directory
|
||||
* load them once and reuse the trie instance
|
||||
* avoid repeated loading in frequently executed code paths (for example, per-request processing)
|
||||
|
||||
|
||||
|
||||
## Next steps
|
||||
|
||||
* [Dictionary format](dictionary-format.md)
|
||||
* [Programmatic usage](programmatic-usage.md)
|
||||
* [Quick start](quick-start.md)
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
The `Compile` CLI is the bridge between:
|
||||
|
||||
* human-readable dictionary data
|
||||
* optimized runtime stemmer tables
|
||||
|
||||
It enables a clean separation between:
|
||||
|
||||
* data preparation
|
||||
* runtime execution
|
||||
|
||||
and is the preferred way to prepare Radixor for production use.
|
||||
255
docs/dictionary-format.md
Normal file
255
docs/dictionary-format.md
Normal file
@@ -0,0 +1,255 @@
|
||||
# Dictionary Format
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
Radixor uses a simple, line-oriented dictionary format to define mappings between **word forms** and their **canonical stems**.
|
||||
|
||||
This format is intentionally minimal, language-agnostic, and easy to generate from existing linguistic resources or corpora.
|
||||
|
||||
## Overview
|
||||
|
||||
Each logical line defines:
|
||||
|
||||
- one **canonical stem**
|
||||
- zero or more **word variants** belonging to that stem
|
||||
|
||||
```
|
||||
stem variant1 variant2 variant3 ...
|
||||
```
|
||||
|
||||
At compile time:
|
||||
|
||||
- each variant is converted into a **patch command** transforming the variant into the stem
|
||||
- the stem itself may optionally be stored as a **no-op mapping**
|
||||
|
||||
## Basic example
|
||||
|
||||
```
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
analyze analyzing analysed analyses
|
||||
```
|
||||
|
||||
This defines:
|
||||
|
||||
| Stem | Variants |
|
||||
|----------|----------------------------------------|
|
||||
| run | running, runs, ran |
|
||||
| connect | connected, connecting, connection |
|
||||
| analyze | analyzing, analysed, analyses |
|
||||
|
||||
## Syntax rules
|
||||
|
||||
### 1. Tokenization
|
||||
|
||||
- Tokens are separated by **whitespace**
|
||||
- Multiple spaces and tabs are treated as a single separator
|
||||
- Leading and trailing whitespace is ignored
|
||||
|
||||
### 2. First token is the stem
|
||||
|
||||
- The **first token** on each line is always the canonical stem
|
||||
- All following tokens are treated as variants of that stem
|
||||
|
||||
### 3. Case normalization
|
||||
|
||||
- All input is normalized to **lowercase using `Locale.ROOT`**
|
||||
- Dictionaries should ideally already be lowercase to avoid ambiguity
|
||||
|
||||
### 4. Empty lines
|
||||
|
||||
- Empty lines are ignored
|
||||
|
||||
### 5. Duplicate variants
|
||||
|
||||
- Duplicate variants are allowed but have no additional effect
|
||||
- Frequency is determined by occurrence across the entire dataset
|
||||
|
||||
## Remarks (comments)
|
||||
|
||||
The parser supports both full-line and trailing remarks.
|
||||
|
||||
### Supported remark markers
|
||||
|
||||
- `#`
|
||||
- `//`
|
||||
|
||||
### Examples
|
||||
|
||||
```
|
||||
run running runs ran # English verb forms
|
||||
connect connected connecting // basic forms
|
||||
```
|
||||
|
||||
Everything after the first occurrence of a remark marker is ignored.
|
||||
|
||||
### Important note
|
||||
|
||||
Remark markers are not escaped. If `#` or `//` appear in a token, they will terminate the line.
|
||||
|
||||
## Storing the original form
|
||||
|
||||
When compiling, you may enable:
|
||||
|
||||
```
|
||||
--store-original
|
||||
```
|
||||
|
||||
This causes the stem itself to be stored using a **no-op patch command**.
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
run running runs
|
||||
```
|
||||
|
||||
With `--store-original`, this implicitly includes:
|
||||
|
||||
```
|
||||
run -> run
|
||||
```
|
||||
|
||||
This is useful when:
|
||||
|
||||
- the input may already be normalized
|
||||
- you want stable identity mappings
|
||||
- you want to avoid missing entries for canonical forms
|
||||
|
||||
## Frequency and ordering
|
||||
|
||||
Radixor tracks **local frequencies** of values.
|
||||
|
||||
Frequency is determined by:
|
||||
|
||||
- how many times a mapping appears during construction
|
||||
- merging behavior during reduction
|
||||
|
||||
When multiple stems exist for a word:
|
||||
|
||||
- results are ordered by **descending frequency**
|
||||
- ties are resolved deterministically:
|
||||
1. shorter textual representation wins
|
||||
2. lexicographically smaller value wins
|
||||
3. earlier insertion order wins
|
||||
|
||||
This guarantees **stable and reproducible results**.
|
||||
|
||||
## Ambiguity and multiple stems
|
||||
|
||||
A word may legitimately map to more than one stem:
|
||||
|
||||
```
|
||||
axes ax axe
|
||||
```
|
||||
|
||||
This allows Radixor to represent ambiguity explicitly.
|
||||
|
||||
At runtime:
|
||||
|
||||
- `get(word)` returns the **preferred result**
|
||||
- `getAll(word)` returns **all candidates**
|
||||
|
||||
## Design guidelines
|
||||
|
||||
### Keep stems consistent
|
||||
|
||||
Use a single canonical form:
|
||||
|
||||
- `run` instead of mixing `run` / `running`
|
||||
- `analyze` vs `analyse` — pick one convention
|
||||
|
||||
### Avoid noise
|
||||
|
||||
Do not include:
|
||||
|
||||
- typos
|
||||
- extremely rare forms (unless required)
|
||||
- inconsistent normalization
|
||||
|
||||
### Prefer completeness over clever rules
|
||||
|
||||
Radixor is data-driven:
|
||||
|
||||
- more complete dictionaries → better results
|
||||
- no hidden rule system compensates for missing entries
|
||||
|
||||
### Handle domain-specific vocabulary
|
||||
|
||||
You can extend dictionaries with:
|
||||
|
||||
- product names
|
||||
- technical terms
|
||||
- organization-specific terminology
|
||||
|
||||
## Example: minimal dictionary
|
||||
|
||||
```
|
||||
go goes going went
|
||||
be is are was were being
|
||||
have has having had
|
||||
```
|
||||
|
||||
## Example: domain-specific extension
|
||||
|
||||
```
|
||||
microservice microservices
|
||||
container containers containerized
|
||||
kubernetes kubernetes
|
||||
```
|
||||
|
||||
## Common pitfalls
|
||||
|
||||
### Mixing cases
|
||||
|
||||
```
|
||||
Run running Runs ❌
|
||||
```
|
||||
|
||||
→ normalized to lowercase, but inconsistent input is error-prone
|
||||
|
||||
### Multiple stems on one line
|
||||
|
||||
```
|
||||
run running connect ❌
|
||||
```
|
||||
|
||||
→ `connect` becomes a variant of `run`, which is incorrect
|
||||
|
||||
### Hidden comments
|
||||
|
||||
```
|
||||
run running //comment runs ❌
|
||||
```
|
||||
|
||||
→ everything after `//` is ignored
|
||||
|
||||
## When to use this format
|
||||
|
||||
This format is suitable for:
|
||||
|
||||
- curated linguistic datasets
|
||||
- exported morphological dictionaries
|
||||
- domain-specific vocabularies
|
||||
- generated `(word, stem)` pairs from corpora
|
||||
|
||||
## Next steps
|
||||
|
||||
- [CLI compilation](cli-compilation.md)
|
||||
- [Programmatic usage](programmatic-usage.md)
|
||||
- [Quick start](quick-start.md)
|
||||
|
||||
## Summary
|
||||
|
||||
Radixor dictionaries are intentionally simple:
|
||||
|
||||
- one line per stem
|
||||
- whitespace-separated tokens
|
||||
- optional remarks
|
||||
- no embedded rules
|
||||
|
||||
This simplicity enables:
|
||||
|
||||
- easy generation
|
||||
- fast parsing
|
||||
- deterministic behavior
|
||||
- efficient compilation into compact patch-command tries
|
||||
322
docs/programmatic-usage.md
Normal file
322
docs/programmatic-usage.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Programmatic Usage
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
This document describes how to use **Radixor** programmatically from Java.
|
||||
|
||||
It covers:
|
||||
|
||||
- building a trie from dictionary data
|
||||
- compiling it into an immutable structure
|
||||
- loading compiled stemmers
|
||||
- querying for stems
|
||||
- working with multiple candidates
|
||||
- modifying existing compiled stemmers
|
||||
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
Radixor separates the stemming lifecycle into three stages:
|
||||
|
||||
1. **Build** – collect word–stem mappings in a mutable structure
|
||||
2. **Compile** – reduce and convert to an immutable trie
|
||||
3. **Query** – perform fast runtime lookups
|
||||
|
||||
These stages are represented by:
|
||||
|
||||
- `FrequencyTrie.Builder` (mutable)
|
||||
- `FrequencyTrie` (immutable, compiled)
|
||||
- `StemmerPatchTrieLoader` / `StemmerPatchTrieBinaryIO` (I/O)
|
||||
|
||||
|
||||
|
||||
## Building a trie programmatically
|
||||
|
||||
You can construct a trie directly without using the CLI.
|
||||
|
||||
```java
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class BuildExample {
|
||||
|
||||
public static void main(String[] args) {
|
||||
ReductionSettings settings = ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
|
||||
FrequencyTrie.Builder<String> builder =
|
||||
new FrequencyTrie.Builder<>(String[]::new, settings);
|
||||
|
||||
PatchCommandEncoder encoder = new PatchCommandEncoder();
|
||||
|
||||
builder.put("running", encoder.encode("running", "run"));
|
||||
builder.put("runs", encoder.encode("runs", "run"));
|
||||
builder.put("ran", encoder.encode("ran", "run"));
|
||||
|
||||
FrequencyTrie<String> trie = builder.build();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Loading from dictionary files
|
||||
|
||||
To parse dictionary files directly:
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class LoadFromDictionaryExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
Path.of("data/stemmer.txt"),
|
||||
true,
|
||||
ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
)
|
||||
);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Loading a compiled binary trie
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class LoadBinaryExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> trie =
|
||||
StemmerPatchTrieLoader.loadBinary(Path.of("english.radixor.gz"));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This is the **preferred production approach**.
|
||||
|
||||
|
||||
|
||||
## Querying for stems
|
||||
|
||||
### Preferred result
|
||||
|
||||
```java
|
||||
String word = "running";
|
||||
String patch = trie.get(word);
|
||||
String stem = PatchCommandEncoder.apply(word, patch);
|
||||
```
|
||||
|
||||
### All candidates
|
||||
|
||||
```java
|
||||
String[] patches = trie.getAll(word);
|
||||
|
||||
for (String patch : patches) {
|
||||
String stem = PatchCommandEncoder.apply(word, patch);
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Accessing value frequencies
|
||||
|
||||
For diagnostic or advanced use cases:
|
||||
|
||||
```java
|
||||
import org.egothor.stemmer.ValueCount;
|
||||
|
||||
java.util.List<ValueCount<String>> entries = trie.getEntries("axes");
|
||||
|
||||
for (ValueCount<String> entry : entries) {
|
||||
String patch = entry.value();
|
||||
int count = entry.count();
|
||||
}
|
||||
```
|
||||
|
||||
This allows:
|
||||
|
||||
* inspecting ambiguity
|
||||
* understanding ranking decisions
|
||||
* debugging dictionary quality
|
||||
|
||||
|
||||
|
||||
## Using bundled language resources
|
||||
|
||||
```java
|
||||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
```
|
||||
|
||||
Bundled dictionaries are useful for:
|
||||
|
||||
* quick integration
|
||||
* testing
|
||||
* reference behavior
|
||||
|
||||
|
||||
|
||||
## Persisting a compiled trie
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class SaveExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
StemmerPatchTrieBinaryIO.write(trie, Path.of("english.radixor.gz"));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Modifying an existing trie
|
||||
|
||||
A compiled trie can be reopened into a builder, extended, and rebuilt.
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class ModifyExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> compiled =
|
||||
StemmerPatchTrieBinaryIO.read(Path.of("english.radixor.gz"));
|
||||
|
||||
ReductionSettings settings = ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
|
||||
FrequencyTrie.Builder<String> builder =
|
||||
FrequencyTrieBuilders.copyOf(compiled, String[]::new, settings);
|
||||
|
||||
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
|
||||
|
||||
FrequencyTrie<String> updated = builder.build();
|
||||
|
||||
StemmerPatchTrieBinaryIO.write(updated,
|
||||
Path.of("english-custom.radixor.gz"));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Thread safety
|
||||
|
||||
* `FrequencyTrie` (compiled):
|
||||
|
||||
* **thread-safe**
|
||||
* safe for concurrent reads
|
||||
|
||||
* `FrequencyTrie.Builder`:
|
||||
|
||||
* **not thread-safe**
|
||||
* intended for single-threaded construction
|
||||
|
||||
|
||||
|
||||
## Performance characteristics
|
||||
|
||||
### Querying
|
||||
|
||||
* O(length of word)
|
||||
* minimal allocations
|
||||
* suitable for high-throughput pipelines
|
||||
|
||||
### Loading
|
||||
|
||||
* binary loading is fast
|
||||
* no preprocessing required
|
||||
|
||||
### Building
|
||||
|
||||
* depends on dictionary size
|
||||
* reduction phase may be CPU-intensive
|
||||
|
||||
|
||||
|
||||
## Best practices
|
||||
|
||||
### Reuse compiled trie instances
|
||||
|
||||
* load once
|
||||
* share across threads
|
||||
|
||||
### Prefer binary loading in production
|
||||
|
||||
* avoid rebuilding at runtime
|
||||
* treat compiled files as deployable artifacts
|
||||
|
||||
### Use `getAll()` only when needed
|
||||
|
||||
* `get()` is faster and sufficient for most use cases
|
||||
|
||||
### Keep builders short-lived
|
||||
|
||||
* build → compile → discard
|
||||
|
||||
|
||||
|
||||
## Integration patterns
|
||||
|
||||
### Search systems
|
||||
|
||||
* apply stemming during indexing and querying
|
||||
* ensure consistent dictionary usage
|
||||
|
||||
### Text normalization pipelines
|
||||
|
||||
* integrate as a transformation step
|
||||
* combine with tokenization and filtering
|
||||
|
||||
### Domain adaptation
|
||||
|
||||
* extend dictionaries with domain-specific vocabulary
|
||||
* rebuild compiled artifacts
|
||||
|
||||
|
||||
|
||||
## Next steps
|
||||
|
||||
* [Dictionary format](dictionary-format.md)
|
||||
* [CLI compilation](cli-compilation.md)
|
||||
* [Architecture and reduction](architecture-and-reduction.md)
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Programmatic usage of Radixor follows a clear pattern:
|
||||
|
||||
* build or load a trie
|
||||
* query using patch commands
|
||||
* apply transformations
|
||||
|
||||
The API is intentionally simple at the surface, while providing deeper control when needed for:
|
||||
|
||||
* ambiguity handling
|
||||
* diagnostics
|
||||
* dictionary evolution
|
||||
317
docs/quality-and-operations.md
Normal file
317
docs/quality-and-operations.md
Normal file
@@ -0,0 +1,317 @@
|
||||
# Quality and Operations
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
This document describes quality, testing, and operational practices for **Radixor**.
|
||||
|
||||
It focuses on:
|
||||
|
||||
- reliability and determinism
|
||||
- testing strategies
|
||||
- deployment patterns
|
||||
- performance considerations
|
||||
- lifecycle management of stemmer data
|
||||
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
Radixor is designed to separate:
|
||||
|
||||
- **data preparation** (dictionary construction and compilation)
|
||||
- **runtime execution** (lookup and patch application)
|
||||
|
||||
This separation enables:
|
||||
|
||||
- predictable runtime behavior
|
||||
- reproducible builds
|
||||
- controlled evolution of stemming data
|
||||
|
||||
|
||||
|
||||
## Determinism and reproducibility
|
||||
|
||||
Radixor emphasizes deterministic behavior.
|
||||
|
||||
### Deterministic outputs
|
||||
|
||||
Given:
|
||||
|
||||
- the same dictionary input
|
||||
- the same reduction settings
|
||||
|
||||
Radixor guarantees:
|
||||
|
||||
- identical compiled trie structure
|
||||
- identical value ordering
|
||||
- identical lookup results
|
||||
|
||||
### Why this matters
|
||||
|
||||
- stable search behavior across deployments
|
||||
- reproducible builds
|
||||
- easier debugging and regression analysis
|
||||
|
||||
|
||||
|
||||
## Testing strategy
|
||||
|
||||
### Unit testing
|
||||
|
||||
Core components should be tested independently:
|
||||
|
||||
- patch encoding and decoding
|
||||
- trie construction
|
||||
- reduction behavior
|
||||
- binary serialization and deserialization
|
||||
|
||||
### Dictionary validation tests
|
||||
|
||||
A recommended pattern:
|
||||
|
||||
1. load dictionary input
|
||||
2. compile trie
|
||||
3. re-apply all word → stem mappings
|
||||
4. verify that:
|
||||
|
||||
- expected stem is present in `getAll()`
|
||||
- preferred result (`get()`) is correct when deterministic
|
||||
|
||||
This ensures:
|
||||
|
||||
- no data loss during reduction
|
||||
- correctness of patch encoding
|
||||
|
||||
|
||||
|
||||
## Regression testing
|
||||
|
||||
Maintain a stable test dataset:
|
||||
|
||||
- representative vocabulary
|
||||
- edge cases (short words, long words, ambiguous forms)
|
||||
|
||||
Use it to:
|
||||
|
||||
- detect unintended changes
|
||||
- verify behavior after refactoring
|
||||
- validate reduction mode changes
|
||||
|
||||
|
||||
|
||||
## Performance testing
|
||||
|
||||
Performance should be evaluated in terms of:
|
||||
|
||||
### Throughput
|
||||
|
||||
- words processed per second
|
||||
|
||||
### Latency
|
||||
|
||||
- time per lookup
|
||||
|
||||
### Memory footprint
|
||||
|
||||
- size of compiled trie
|
||||
- runtime memory usage
|
||||
|
||||
Benchmark with:
|
||||
|
||||
- realistic token streams
|
||||
- production-like dictionaries
|
||||
|
||||
|
||||
|
||||
## Deployment model
|
||||
|
||||
### Recommended workflow
|
||||
|
||||
1. prepare dictionary data
|
||||
2. compile using CLI
|
||||
3. store `.radixor.gz` artifact
|
||||
4. deploy artifact with application
|
||||
5. load using `loadBinary(...)`
|
||||
|
||||
### Why this model
|
||||
|
||||
- avoids runtime compilation overhead
|
||||
- reduces startup latency
|
||||
- ensures consistent behavior across environments
|
||||
|
||||
|
||||
|
||||
## Artifact management
|
||||
|
||||
Compiled stemmers should be treated as versioned assets.
|
||||
|
||||
### Versioning
|
||||
|
||||
- include version in filename or metadata
|
||||
- track dictionary source and reduction settings
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
english-v1.2-ranked.radixor.gz
|
||||
```
|
||||
|
||||
### Storage
|
||||
|
||||
- store in repository or artifact storage
|
||||
- ensure consistent distribution across environments
|
||||
|
||||
|
||||
|
||||
## Runtime usage
|
||||
|
||||
### Loading
|
||||
|
||||
- load once during application startup
|
||||
- reuse `FrequencyTrie` instance
|
||||
|
||||
### Thread safety
|
||||
|
||||
- compiled trie is safe for concurrent access
|
||||
- no synchronization required for reads
|
||||
|
||||
### Avoid repeated loading
|
||||
|
||||
Do not:
|
||||
|
||||
- load trie per request
|
||||
- rebuild trie at runtime
|
||||
|
||||
|
||||
|
||||
## Memory considerations
|
||||
|
||||
- compiled tries are compact but not negligible
|
||||
- size depends on:
|
||||
- dictionary size
|
||||
- reduction mode
|
||||
|
||||
Recommendations:
|
||||
|
||||
- monitor memory usage in production
|
||||
- choose reduction mode appropriately
|
||||
|
||||
|
||||
|
||||
## Reduction mode in production
|
||||
|
||||
Default recommendation:
|
||||
|
||||
- use **ranked mode**
|
||||
|
||||
Switch to other modes only when:
|
||||
|
||||
- memory constraints are strict
|
||||
- multiple candidate results are not required
|
||||
|
||||
Always validate behavior after changing reduction mode.
|
||||
|
||||
|
||||
|
||||
## Dictionary lifecycle
|
||||
|
||||
### Updating dictionaries
|
||||
|
||||
When dictionary data changes:
|
||||
|
||||
1. update source file
|
||||
2. recompile
|
||||
3. run validation tests
|
||||
4. deploy new artifact
|
||||
|
||||
### Backward compatibility
|
||||
|
||||
- changes in dictionary may affect stemming results
|
||||
- evaluate impact on search relevance
|
||||
|
||||
|
||||
|
||||
## Observability
|
||||
|
||||
Radixor itself does not provide observability features; integration should provide:
|
||||
|
||||
- logging for loading failures
|
||||
- metrics for lookup throughput
|
||||
- monitoring of memory usage
|
||||
|
||||
Optional:
|
||||
|
||||
- sampling of ambiguous results (`getAll()`)
|
||||
|
||||
|
||||
|
||||
## Error handling
|
||||
|
||||
### During compilation
|
||||
|
||||
Handle:
|
||||
|
||||
- invalid dictionary format
|
||||
- I/O failures
|
||||
- invalid arguments
|
||||
|
||||
### During runtime
|
||||
|
||||
Handle:
|
||||
|
||||
- missing dictionary files
|
||||
- corrupted binary artifacts
|
||||
|
||||
Fail fast on initialization errors.
|
||||
|
||||
|
||||
|
||||
## Operational best practices
|
||||
|
||||
- compile dictionaries offline
|
||||
- version compiled artifacts
|
||||
- test before deployment
|
||||
- load once and reuse
|
||||
- monitor performance and memory
|
||||
- document reduction settings used
|
||||
|
||||
|
||||
|
||||
## Security considerations
|
||||
|
||||
- treat dictionary input as trusted data
|
||||
- validate external sources before compilation
|
||||
- avoid loading unverified binary artifacts
|
||||
|
||||
|
||||
|
||||
## Integration checklist
|
||||
|
||||
Before production deployment:
|
||||
|
||||
- dictionary validated
|
||||
- compiled artifact generated
|
||||
- reduction mode documented
|
||||
- performance tested
|
||||
- memory usage verified
|
||||
- regression tests passing
|
||||
|
||||
|
||||
|
||||
## Next steps
|
||||
|
||||
- [Quick start](quick-start.md)
|
||||
- [CLI compilation](cli-compilation.md)
|
||||
- [Programmatic usage](programmatic-usage.md)
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Radixor is designed for:
|
||||
|
||||
- deterministic behavior
|
||||
- efficient runtime execution
|
||||
- controlled data-driven evolution
|
||||
|
||||
By separating compilation from runtime and following proper operational practices, it can be reliably integrated into production-grade systems.
|
||||
148
docs/quick-start.md
Normal file
148
docs/quick-start.md
Normal file
@@ -0,0 +1,148 @@
|
||||
# Quick Start
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
This guide shows the fastest way to start using **Radixor** and the most common next steps.
|
||||
|
||||
## Hello world
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
|
||||
import org.egothor.stemmer.FrequencyTrie;
|
||||
import org.egothor.stemmer.PatchCommandEncoder;
|
||||
import org.egothor.stemmer.ReductionMode;
|
||||
import org.egothor.stemmer.StemmerPatchTrieLoader;
|
||||
|
||||
public final class HelloRadixor {
|
||||
|
||||
private HelloRadixor() {
|
||||
throw new AssertionError("No instances.");
|
||||
}
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||
|
||||
final String word = "running";
|
||||
final String patch = trie.get(word);
|
||||
final String stem = PatchCommandEncoder.apply(word, patch);
|
||||
|
||||
System.out.println(word + " -> " + stem);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This example shows the core workflow:
|
||||
|
||||
1. load a trie
|
||||
2. get a patch command for a word
|
||||
3. apply the patch
|
||||
4. obtain the stem
|
||||
|
||||
## Retrieve multiple candidate stems
|
||||
|
||||
If you need more than one candidate result, use `getAll(...)` instead of `get(...)`.
|
||||
|
||||
```java
|
||||
final String word = "axes";
|
||||
final String[] patches = trie.getAll(word);
|
||||
|
||||
for (String patch : patches) {
|
||||
final String stem = PatchCommandEncoder.apply(word, patch);
|
||||
System.out.println(word + " -> " + stem + " (" + patch + ")");
|
||||
}
|
||||
```
|
||||
|
||||
## Load a compiled binary stemmer
|
||||
|
||||
For production systems, the preferred approach is usually to precompile the dictionary and load the compressed binary artifact at runtime.
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.FrequencyTrie;
|
||||
import org.egothor.stemmer.PatchCommandEncoder;
|
||||
import org.egothor.stemmer.StemmerPatchTrieLoader;
|
||||
|
||||
public final class BinaryStemmerExample {
|
||||
|
||||
private BinaryStemmerExample() {
|
||||
throw new AssertionError("No instances.");
|
||||
}
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final Path path = Path.of("stemmers", "english.radixor.gz");
|
||||
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.loadBinary(path);
|
||||
|
||||
final String word = "connected";
|
||||
final String patch = trie.get(word);
|
||||
final String stem = PatchCommandEncoder.apply(word, patch);
|
||||
|
||||
System.out.println(word + " -> " + stem);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Compile a dictionary from the command line
|
||||
|
||||
```bash
|
||||
java org.egothor.stemmer.Compile \
|
||||
--input ./data/stemmer.txt \
|
||||
--output ./build/english.radixor.gz \
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
|
||||
--store-original \
|
||||
--overwrite
|
||||
```
|
||||
|
||||
## Modify an existing compiled stemmer
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.FrequencyTrie;
|
||||
import org.egothor.stemmer.FrequencyTrieBuilders;
|
||||
import org.egothor.stemmer.PatchCommandEncoder;
|
||||
import org.egothor.stemmer.ReductionMode;
|
||||
import org.egothor.stemmer.ReductionSettings;
|
||||
import org.egothor.stemmer.StemmerPatchTrieBinaryIO;
|
||||
|
||||
public final class ModifyCompiledExample {
|
||||
|
||||
private ModifyCompiledExample() {
|
||||
throw new AssertionError("No instances.");
|
||||
}
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final Path input = Path.of("stemmers", "english.radixor.gz");
|
||||
final Path output = Path.of("stemmers", "english-custom.radixor.gz");
|
||||
|
||||
final FrequencyTrie<String> compiledTrie = StemmerPatchTrieBinaryIO.read(input);
|
||||
|
||||
final ReductionSettings settings = ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||
|
||||
final FrequencyTrie.Builder<String> builder = FrequencyTrieBuilders.copyOf(
|
||||
compiledTrie,
|
||||
String[]::new,
|
||||
settings);
|
||||
|
||||
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
|
||||
|
||||
final FrequencyTrie<String> updatedTrie = builder.build();
|
||||
StemmerPatchTrieBinaryIO.write(updatedTrie, output);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Where to continue
|
||||
|
||||
* [Dictionary format](dictionary-format.md)
|
||||
* [CLI compilation](cli-compilation.md)
|
||||
* [Programmatic usage](programmatic-usage.md)
|
||||
* [Built-in languages](built-in-languages.md)
|
||||
* [Architecture and reduction](architecture-and-reduction.md)
|
||||
Reference in New Issue
Block a user