Files
Radixor/docs/programmatic-usage.md
Leo Galambos 038514bad0 Refine stemmer core, compiled trie workflow, tests, and public documentation
feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening
2026-04-13 02:10:46 +02:00

323 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Programmatic Usage
> ← Back to [README.md](../README.md)
This document describes how to use **Radixor** programmatically from Java.
It covers:
- building a trie from dictionary data
- compiling it into an immutable structure
- loading compiled stemmers
- querying for stems
- working with multiple candidates
- modifying existing compiled stemmers
## Overview
Radixor separates the stemming lifecycle into three stages:
1. **Build** collect wordstem mappings in a mutable structure
2. **Compile** reduce and convert to an immutable trie
3. **Query** perform fast runtime lookups
These stages are represented by:
- `FrequencyTrie.Builder` (mutable)
- `FrequencyTrie` (immutable, compiled)
- `StemmerPatchTrieLoader` / `StemmerPatchTrieBinaryIO` (I/O)
## Building a trie programmatically
You can construct a trie directly without using the CLI.
```java
import org.egothor.stemmer.*;
public final class BuildExample {
public static void main(String[] args) {
ReductionSettings settings = ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
FrequencyTrie.Builder<String> builder =
new FrequencyTrie.Builder<>(String[]::new, settings);
PatchCommandEncoder encoder = new PatchCommandEncoder();
builder.put("running", encoder.encode("running", "run"));
builder.put("runs", encoder.encode("runs", "run"));
builder.put("ran", encoder.encode("ran", "run"));
FrequencyTrie<String> trie = builder.build();
}
}
```
## Loading from dictionary files
To parse dictionary files directly:
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class LoadFromDictionaryExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
Path.of("data/stemmer.txt"),
true,
ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
)
);
}
}
```
## Loading a compiled binary trie
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class LoadBinaryExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary(Path.of("english.radixor.gz"));
}
}
```
This is the **preferred production approach**.
## Querying for stems
### Preferred result
```java
String word = "running";
String patch = trie.get(word);
String stem = PatchCommandEncoder.apply(word, patch);
```
### All candidates
```java
String[] patches = trie.getAll(word);
for (String patch : patches) {
String stem = PatchCommandEncoder.apply(word, patch);
}
```
## Accessing value frequencies
For diagnostic or advanced use cases:
```java
import org.egothor.stemmer.ValueCount;
java.util.List<ValueCount<String>> entries = trie.getEntries("axes");
for (ValueCount<String> entry : entries) {
String patch = entry.value();
int count = entry.count();
}
```
This allows:
* inspecting ambiguity
* understanding ranking decisions
* debugging dictionary quality
## Using bundled language resources
```java
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
```
Bundled dictionaries are useful for:
* quick integration
* testing
* reference behavior
## Persisting a compiled trie
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class SaveExample {
public static void main(String[] args) throws IOException {
StemmerPatchTrieBinaryIO.write(trie, Path.of("english.radixor.gz"));
}
}
```
## Modifying an existing trie
A compiled trie can be reopened into a builder, extended, and rebuilt.
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class ModifyExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> compiled =
StemmerPatchTrieBinaryIO.read(Path.of("english.radixor.gz"));
ReductionSettings settings = ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
FrequencyTrie.Builder<String> builder =
FrequencyTrieBuilders.copyOf(compiled, String[]::new, settings);
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
FrequencyTrie<String> updated = builder.build();
StemmerPatchTrieBinaryIO.write(updated,
Path.of("english-custom.radixor.gz"));
}
}
```
## Thread safety
* `FrequencyTrie` (compiled):
* **thread-safe**
* safe for concurrent reads
* `FrequencyTrie.Builder`:
* **not thread-safe**
* intended for single-threaded construction
## Performance characteristics
### Querying
* O(length of word)
* minimal allocations
* suitable for high-throughput pipelines
### Loading
* binary loading is fast
* no preprocessing required
### Building
* depends on dictionary size
* reduction phase may be CPU-intensive
## Best practices
### Reuse compiled trie instances
* load once
* share across threads
### Prefer binary loading in production
* avoid rebuilding at runtime
* treat compiled files as deployable artifacts
### Use `getAll()` only when needed
* `get()` is faster and sufficient for most use cases
### Keep builders short-lived
* build → compile → discard
## Integration patterns
### Search systems
* apply stemming during indexing and querying
* ensure consistent dictionary usage
### Text normalization pipelines
* integrate as a transformation step
* combine with tokenization and filtering
### Domain adaptation
* extend dictionaries with domain-specific vocabulary
* rebuild compiled artifacts
## Next steps
* [Dictionary format](dictionary-format.md)
* [CLI compilation](cli-compilation.md)
* [Architecture and reduction](architecture-and-reduction.md)
## Summary
Programmatic usage of Radixor follows a clear pattern:
* build or load a trie
* query using patch commands
* apply transformations
The API is intentionally simple at the surface, while providing deeper control when needed for:
* ambiguity handling
* diagnostics
* dictionary evolution