feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
323 lines
6.5 KiB
Markdown
323 lines
6.5 KiB
Markdown
# Programmatic Usage
|
||
|
||
> ← Back to [README.md](../README.md)
|
||
|
||
This document describes how to use **Radixor** programmatically from Java.
|
||
|
||
It covers:
|
||
|
||
- building a trie from dictionary data
|
||
- compiling it into an immutable structure
|
||
- loading compiled stemmers
|
||
- querying for stems
|
||
- working with multiple candidates
|
||
- modifying existing compiled stemmers
|
||
|
||
|
||
|
||
## Overview
|
||
|
||
Radixor separates the stemming lifecycle into three stages:
|
||
|
||
1. **Build** – collect word–stem mappings in a mutable structure
|
||
2. **Compile** – reduce and convert to an immutable trie
|
||
3. **Query** – perform fast runtime lookups
|
||
|
||
These stages are represented by:
|
||
|
||
- `FrequencyTrie.Builder` (mutable)
|
||
- `FrequencyTrie` (immutable, compiled)
|
||
- `StemmerPatchTrieLoader` / `StemmerPatchTrieBinaryIO` (I/O)
|
||
|
||
|
||
|
||
## Building a trie programmatically
|
||
|
||
You can construct a trie directly without using the CLI.
|
||
|
||
```java
|
||
import org.egothor.stemmer.*;
|
||
|
||
public final class BuildExample {
|
||
|
||
public static void main(String[] args) {
|
||
ReductionSettings settings = ReductionSettings.withDefaults(
|
||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
);
|
||
|
||
FrequencyTrie.Builder<String> builder =
|
||
new FrequencyTrie.Builder<>(String[]::new, settings);
|
||
|
||
PatchCommandEncoder encoder = new PatchCommandEncoder();
|
||
|
||
builder.put("running", encoder.encode("running", "run"));
|
||
builder.put("runs", encoder.encode("runs", "run"));
|
||
builder.put("ran", encoder.encode("ran", "run"));
|
||
|
||
FrequencyTrie<String> trie = builder.build();
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
|
||
## Loading from dictionary files
|
||
|
||
To parse dictionary files directly:
|
||
|
||
```java
|
||
import java.io.IOException;
|
||
import java.nio.file.Path;
|
||
|
||
import org.egothor.stemmer.*;
|
||
|
||
public final class LoadFromDictionaryExample {
|
||
|
||
public static void main(String[] args) throws IOException {
|
||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||
Path.of("data/stemmer.txt"),
|
||
true,
|
||
ReductionSettings.withDefaults(
|
||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
)
|
||
);
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
|
||
## Loading a compiled binary trie
|
||
|
||
```java
|
||
import java.io.IOException;
|
||
import java.nio.file.Path;
|
||
|
||
import org.egothor.stemmer.*;
|
||
|
||
public final class LoadBinaryExample {
|
||
|
||
public static void main(String[] args) throws IOException {
|
||
FrequencyTrie<String> trie =
|
||
StemmerPatchTrieLoader.loadBinary(Path.of("english.radixor.gz"));
|
||
}
|
||
}
|
||
```
|
||
|
||
This is the **preferred production approach**.
|
||
|
||
|
||
|
||
## Querying for stems
|
||
|
||
### Preferred result
|
||
|
||
```java
|
||
String word = "running";
|
||
String patch = trie.get(word);
|
||
String stem = PatchCommandEncoder.apply(word, patch);
|
||
```
|
||
|
||
### All candidates
|
||
|
||
```java
|
||
String[] patches = trie.getAll(word);
|
||
|
||
for (String patch : patches) {
|
||
String stem = PatchCommandEncoder.apply(word, patch);
|
||
}
|
||
```
|
||
|
||
|
||
|
||
## Accessing value frequencies
|
||
|
||
For diagnostic or advanced use cases:
|
||
|
||
```java
|
||
import org.egothor.stemmer.ValueCount;
|
||
|
||
java.util.List<ValueCount<String>> entries = trie.getEntries("axes");
|
||
|
||
for (ValueCount<String> entry : entries) {
|
||
String patch = entry.value();
|
||
int count = entry.count();
|
||
}
|
||
```
|
||
|
||
This allows:
|
||
|
||
* inspecting ambiguity
|
||
* understanding ranking decisions
|
||
* debugging dictionary quality
|
||
|
||
|
||
|
||
## Using bundled language resources
|
||
|
||
```java
|
||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||
true,
|
||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
);
|
||
```
|
||
|
||
Bundled dictionaries are useful for:
|
||
|
||
* quick integration
|
||
* testing
|
||
* reference behavior
|
||
|
||
|
||
|
||
## Persisting a compiled trie
|
||
|
||
```java
|
||
import java.io.IOException;
|
||
import java.nio.file.Path;
|
||
|
||
import org.egothor.stemmer.*;
|
||
|
||
public final class SaveExample {
|
||
|
||
public static void main(String[] args) throws IOException {
|
||
StemmerPatchTrieBinaryIO.write(trie, Path.of("english.radixor.gz"));
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
|
||
## Modifying an existing trie
|
||
|
||
A compiled trie can be reopened into a builder, extended, and rebuilt.
|
||
|
||
```java
|
||
import java.io.IOException;
|
||
import java.nio.file.Path;
|
||
|
||
import org.egothor.stemmer.*;
|
||
|
||
public final class ModifyExample {
|
||
|
||
public static void main(String[] args) throws IOException {
|
||
FrequencyTrie<String> compiled =
|
||
StemmerPatchTrieBinaryIO.read(Path.of("english.radixor.gz"));
|
||
|
||
ReductionSettings settings = ReductionSettings.withDefaults(
|
||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
);
|
||
|
||
FrequencyTrie.Builder<String> builder =
|
||
FrequencyTrieBuilders.copyOf(compiled, String[]::new, settings);
|
||
|
||
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
|
||
|
||
FrequencyTrie<String> updated = builder.build();
|
||
|
||
StemmerPatchTrieBinaryIO.write(updated,
|
||
Path.of("english-custom.radixor.gz"));
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
|
||
## Thread safety
|
||
|
||
* `FrequencyTrie` (compiled):
|
||
|
||
* **thread-safe**
|
||
* safe for concurrent reads
|
||
|
||
* `FrequencyTrie.Builder`:
|
||
|
||
* **not thread-safe**
|
||
* intended for single-threaded construction
|
||
|
||
|
||
|
||
## Performance characteristics
|
||
|
||
### Querying
|
||
|
||
* O(length of word)
|
||
* minimal allocations
|
||
* suitable for high-throughput pipelines
|
||
|
||
### Loading
|
||
|
||
* binary loading is fast
|
||
* no preprocessing required
|
||
|
||
### Building
|
||
|
||
* depends on dictionary size
|
||
* reduction phase may be CPU-intensive
|
||
|
||
|
||
|
||
## Best practices
|
||
|
||
### Reuse compiled trie instances
|
||
|
||
* load once
|
||
* share across threads
|
||
|
||
### Prefer binary loading in production
|
||
|
||
* avoid rebuilding at runtime
|
||
* treat compiled files as deployable artifacts
|
||
|
||
### Use `getAll()` only when needed
|
||
|
||
* `get()` is faster and sufficient for most use cases
|
||
|
||
### Keep builders short-lived
|
||
|
||
* build → compile → discard
|
||
|
||
|
||
|
||
## Integration patterns
|
||
|
||
### Search systems
|
||
|
||
* apply stemming during indexing and querying
|
||
* ensure consistent dictionary usage
|
||
|
||
### Text normalization pipelines
|
||
|
||
* integrate as a transformation step
|
||
* combine with tokenization and filtering
|
||
|
||
### Domain adaptation
|
||
|
||
* extend dictionaries with domain-specific vocabulary
|
||
* rebuild compiled artifacts
|
||
|
||
|
||
|
||
## Next steps
|
||
|
||
* [Dictionary format](dictionary-format.md)
|
||
* [CLI compilation](cli-compilation.md)
|
||
* [Architecture and reduction](architecture-and-reduction.md)
|
||
|
||
|
||
|
||
## Summary
|
||
|
||
Programmatic usage of Radixor follows a clear pattern:
|
||
|
||
* build or load a trie
|
||
* query using patch commands
|
||
* apply transformations
|
||
|
||
The API is intentionally simple at the surface, while providing deeper control when needed for:
|
||
|
||
* ambiguity handling
|
||
* diagnostics
|
||
* dictionary evolution
|