Refine stemmer core, compiled trie workflow, tests, and public documentation
feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
This commit is contained in:
322
docs/programmatic-usage.md
Normal file
322
docs/programmatic-usage.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Programmatic Usage
|
||||
|
||||
> ← Back to [README.md](../README.md)
|
||||
|
||||
This document describes how to use **Radixor** programmatically from Java.
|
||||
|
||||
It covers:
|
||||
|
||||
- building a trie from dictionary data
|
||||
- compiling it into an immutable structure
|
||||
- loading compiled stemmers
|
||||
- querying for stems
|
||||
- working with multiple candidates
|
||||
- modifying existing compiled stemmers
|
||||
|
||||
|
||||
|
||||
## Overview
|
||||
|
||||
Radixor separates the stemming lifecycle into three stages:
|
||||
|
||||
1. **Build** – collect word–stem mappings in a mutable structure
|
||||
2. **Compile** – reduce and convert to an immutable trie
|
||||
3. **Query** – perform fast runtime lookups
|
||||
|
||||
These stages are represented by:
|
||||
|
||||
- `FrequencyTrie.Builder` (mutable)
|
||||
- `FrequencyTrie` (immutable, compiled)
|
||||
- `StemmerPatchTrieLoader` / `StemmerPatchTrieBinaryIO` (I/O)
|
||||
|
||||
|
||||
|
||||
## Building a trie programmatically
|
||||
|
||||
You can construct a trie directly without using the CLI.
|
||||
|
||||
```java
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class BuildExample {
|
||||
|
||||
public static void main(String[] args) {
|
||||
ReductionSettings settings = ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
|
||||
FrequencyTrie.Builder<String> builder =
|
||||
new FrequencyTrie.Builder<>(String[]::new, settings);
|
||||
|
||||
PatchCommandEncoder encoder = new PatchCommandEncoder();
|
||||
|
||||
builder.put("running", encoder.encode("running", "run"));
|
||||
builder.put("runs", encoder.encode("runs", "run"));
|
||||
builder.put("ran", encoder.encode("ran", "run"));
|
||||
|
||||
FrequencyTrie<String> trie = builder.build();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Loading from dictionary files
|
||||
|
||||
To parse dictionary files directly:
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class LoadFromDictionaryExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
Path.of("data/stemmer.txt"),
|
||||
true,
|
||||
ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
)
|
||||
);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Loading a compiled binary trie
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class LoadBinaryExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> trie =
|
||||
StemmerPatchTrieLoader.loadBinary(Path.of("english.radixor.gz"));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This is the **preferred production approach**.
|
||||
|
||||
|
||||
|
||||
## Querying for stems
|
||||
|
||||
### Preferred result
|
||||
|
||||
```java
|
||||
String word = "running";
|
||||
String patch = trie.get(word);
|
||||
String stem = PatchCommandEncoder.apply(word, patch);
|
||||
```
|
||||
|
||||
### All candidates
|
||||
|
||||
```java
|
||||
String[] patches = trie.getAll(word);
|
||||
|
||||
for (String patch : patches) {
|
||||
String stem = PatchCommandEncoder.apply(word, patch);
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Accessing value frequencies
|
||||
|
||||
For diagnostic or advanced use cases:
|
||||
|
||||
```java
|
||||
import org.egothor.stemmer.ValueCount;
|
||||
|
||||
java.util.List<ValueCount<String>> entries = trie.getEntries("axes");
|
||||
|
||||
for (ValueCount<String> entry : entries) {
|
||||
String patch = entry.value();
|
||||
int count = entry.count();
|
||||
}
|
||||
```
|
||||
|
||||
This allows:
|
||||
|
||||
* inspecting ambiguity
|
||||
* understanding ranking decisions
|
||||
* debugging dictionary quality
|
||||
|
||||
|
||||
|
||||
## Using bundled language resources
|
||||
|
||||
```java
|
||||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
```
|
||||
|
||||
Bundled dictionaries are useful for:
|
||||
|
||||
* quick integration
|
||||
* testing
|
||||
* reference behavior
|
||||
|
||||
|
||||
|
||||
## Persisting a compiled trie
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class SaveExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
StemmerPatchTrieBinaryIO.write(trie, Path.of("english.radixor.gz"));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Modifying an existing trie
|
||||
|
||||
A compiled trie can be reopened into a builder, extended, and rebuilt.
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
|
||||
public final class ModifyExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> compiled =
|
||||
StemmerPatchTrieBinaryIO.read(Path.of("english.radixor.gz"));
|
||||
|
||||
ReductionSettings settings = ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
|
||||
FrequencyTrie.Builder<String> builder =
|
||||
FrequencyTrieBuilders.copyOf(compiled, String[]::new, settings);
|
||||
|
||||
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
|
||||
|
||||
FrequencyTrie<String> updated = builder.build();
|
||||
|
||||
StemmerPatchTrieBinaryIO.write(updated,
|
||||
Path.of("english-custom.radixor.gz"));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Thread safety
|
||||
|
||||
* `FrequencyTrie` (compiled):
|
||||
|
||||
* **thread-safe**
|
||||
* safe for concurrent reads
|
||||
|
||||
* `FrequencyTrie.Builder`:
|
||||
|
||||
* **not thread-safe**
|
||||
* intended for single-threaded construction
|
||||
|
||||
|
||||
|
||||
## Performance characteristics
|
||||
|
||||
### Querying
|
||||
|
||||
* O(length of word)
|
||||
* minimal allocations
|
||||
* suitable for high-throughput pipelines
|
||||
|
||||
### Loading
|
||||
|
||||
* binary loading is fast
|
||||
* no preprocessing required
|
||||
|
||||
### Building
|
||||
|
||||
* depends on dictionary size
|
||||
* reduction phase may be CPU-intensive
|
||||
|
||||
|
||||
|
||||
## Best practices
|
||||
|
||||
### Reuse compiled trie instances
|
||||
|
||||
* load once
|
||||
* share across threads
|
||||
|
||||
### Prefer binary loading in production
|
||||
|
||||
* avoid rebuilding at runtime
|
||||
* treat compiled files as deployable artifacts
|
||||
|
||||
### Use `getAll()` only when needed
|
||||
|
||||
* `get()` is faster and sufficient for most use cases
|
||||
|
||||
### Keep builders short-lived
|
||||
|
||||
* build → compile → discard
|
||||
|
||||
|
||||
|
||||
## Integration patterns
|
||||
|
||||
### Search systems
|
||||
|
||||
* apply stemming during indexing and querying
|
||||
* ensure consistent dictionary usage
|
||||
|
||||
### Text normalization pipelines
|
||||
|
||||
* integrate as a transformation step
|
||||
* combine with tokenization and filtering
|
||||
|
||||
### Domain adaptation
|
||||
|
||||
* extend dictionaries with domain-specific vocabulary
|
||||
* rebuild compiled artifacts
|
||||
|
||||
|
||||
|
||||
## Next steps
|
||||
|
||||
* [Dictionary format](dictionary-format.md)
|
||||
* [CLI compilation](cli-compilation.md)
|
||||
* [Architecture and reduction](architecture-and-reduction.md)
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Programmatic usage of Radixor follows a clear pattern:
|
||||
|
||||
* build or load a trie
|
||||
* query using patch commands
|
||||
* apply transformations
|
||||
|
||||
The API is intentionally simple at the surface, while providing deeper control when needed for:
|
||||
|
||||
* ambiguity handling
|
||||
* diagnostics
|
||||
* dictionary evolution
|
||||
Reference in New Issue
Block a user