feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
6.5 KiB
6.5 KiB
Programmatic Usage
← Back to README.md
This document describes how to use Radixor programmatically from Java.
It covers:
- building a trie from dictionary data
- compiling it into an immutable structure
- loading compiled stemmers
- querying for stems
- working with multiple candidates
- modifying existing compiled stemmers
Overview
Radixor separates the stemming lifecycle into three stages:
- Build – collect word–stem mappings in a mutable structure
- Compile – reduce and convert to an immutable trie
- Query – perform fast runtime lookups
These stages are represented by:
FrequencyTrie.Builder(mutable)FrequencyTrie(immutable, compiled)StemmerPatchTrieLoader/StemmerPatchTrieBinaryIO(I/O)
Building a trie programmatically
You can construct a trie directly without using the CLI.
import org.egothor.stemmer.*;
public final class BuildExample {
public static void main(String[] args) {
ReductionSettings settings = ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
FrequencyTrie.Builder<String> builder =
new FrequencyTrie.Builder<>(String[]::new, settings);
PatchCommandEncoder encoder = new PatchCommandEncoder();
builder.put("running", encoder.encode("running", "run"));
builder.put("runs", encoder.encode("runs", "run"));
builder.put("ran", encoder.encode("ran", "run"));
FrequencyTrie<String> trie = builder.build();
}
}
Loading from dictionary files
To parse dictionary files directly:
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class LoadFromDictionaryExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
Path.of("data/stemmer.txt"),
true,
ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
)
);
}
}
Loading a compiled binary trie
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class LoadBinaryExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary(Path.of("english.radixor.gz"));
}
}
This is the preferred production approach.
Querying for stems
Preferred result
String word = "running";
String patch = trie.get(word);
String stem = PatchCommandEncoder.apply(word, patch);
All candidates
String[] patches = trie.getAll(word);
for (String patch : patches) {
String stem = PatchCommandEncoder.apply(word, patch);
}
Accessing value frequencies
For diagnostic or advanced use cases:
import org.egothor.stemmer.ValueCount;
java.util.List<ValueCount<String>> entries = trie.getEntries("axes");
for (ValueCount<String> entry : entries) {
String patch = entry.value();
int count = entry.count();
}
This allows:
- inspecting ambiguity
- understanding ranking decisions
- debugging dictionary quality
Using bundled language resources
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
Bundled dictionaries are useful for:
- quick integration
- testing
- reference behavior
Persisting a compiled trie
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class SaveExample {
public static void main(String[] args) throws IOException {
StemmerPatchTrieBinaryIO.write(trie, Path.of("english.radixor.gz"));
}
}
Modifying an existing trie
A compiled trie can be reopened into a builder, extended, and rebuilt.
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class ModifyExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> compiled =
StemmerPatchTrieBinaryIO.read(Path.of("english.radixor.gz"));
ReductionSettings settings = ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
FrequencyTrie.Builder<String> builder =
FrequencyTrieBuilders.copyOf(compiled, String[]::new, settings);
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
FrequencyTrie<String> updated = builder.build();
StemmerPatchTrieBinaryIO.write(updated,
Path.of("english-custom.radixor.gz"));
}
}
Thread safety
-
FrequencyTrie(compiled):- thread-safe
- safe for concurrent reads
-
FrequencyTrie.Builder:- not thread-safe
- intended for single-threaded construction
Performance characteristics
Querying
- O(length of word)
- minimal allocations
- suitable for high-throughput pipelines
Loading
- binary loading is fast
- no preprocessing required
Building
- depends on dictionary size
- reduction phase may be CPU-intensive
Best practices
Reuse compiled trie instances
- load once
- share across threads
Prefer binary loading in production
- avoid rebuilding at runtime
- treat compiled files as deployable artifacts
Use getAll() only when needed
get()is faster and sufficient for most use cases
Keep builders short-lived
- build → compile → discard
Integration patterns
Search systems
- apply stemming during indexing and querying
- ensure consistent dictionary usage
Text normalization pipelines
- integrate as a transformation step
- combine with tokenization and filtering
Domain adaptation
- extend dictionaries with domain-specific vocabulary
- rebuild compiled artifacts
Next steps
Summary
Programmatic usage of Radixor follows a clear pattern:
- build or load a trie
- query using patch commands
- apply transformations
The API is intentionally simple at the surface, while providing deeper control when needed for:
- ambiguity handling
- diagnostics
- dictionary evolution