feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
5.6 KiB
5.6 KiB
Built-in Languages
← Back to README.md
Radixor provides a set of bundled stemmer dictionaries that can be loaded directly without preparing custom data.
These built-in resources are useful for:
- quick integration
- testing and evaluation
- reference behavior
- prototyping search pipelines
Overview
Bundled dictionaries are exposed through:
StemmerPatchTrieLoader.Language
They are packaged with the library and loaded from the classpath.
Supported languages
The following language identifiers are currently available:
| Language | Enum constant | Description |
|---|---|---|
| Danish | DA_DK |
Danish |
| German | DE_DE |
German |
| Spanish | ES_ES |
Spanish |
| French | FR_FR |
French |
| Italian | IT_IT |
Italian |
| Dutch | NL_NL |
Dutch |
| Norwegian | NO_NO |
Norwegian |
| Portuguese | PT_PT |
Portuguese |
| Russian | RU_RU |
Russian |
| Swedish | SV_SE |
Swedish |
| English | US_UK |
Standard English |
| English | US_UK_PROFI |
Extended English dictionary |
Basic usage
Load a bundled stemmer:
import java.io.IOException;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class BuiltInExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
}
}
Example: stemming with US_UK_PROFI
import java.io.IOException;
import org.egothor.stemmer.*;
public final class EnglishExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
String word = "running";
String patch = trie.get(word);
String stem = PatchCommandEncoder.apply(word, patch);
System.out.println(word + " -> " + stem);
}
}
US_UK vs US_UK_PROFI
US_UK
- smaller dictionary
- faster load time
- suitable for lightweight use cases
US_UK_PROFI
- larger and more complete dataset
- better coverage of word forms
- improved stemming quality
- slightly larger memory footprint
Recommendation
Use:
US_UK_PROFI
```
for most applications unless memory constraints are strict.
## How bundled dictionaries are loaded
Internally:
- dictionaries are stored as text resources
- parsed using `StemmerDictionaryParser`
- compiled into a trie at load time
This means:
- first load includes parsing + compilation cost
- subsequent usage is fast
## When to use bundled languages
Bundled dictionaries are suitable when:
- you need quick results without preparing custom data
- you are prototyping or experimenting
- your language requirements match the provided datasets
## When to use custom dictionaries
You should prefer custom dictionaries when:
- domain-specific vocabulary is important
- accuracy requirements are high
- you need full control over stemming behavior
Typical examples:
- technical terminology
- product catalogs
- biomedical text
- legal or financial language
## Production recommendation
For production systems:
1. Load a bundled dictionary
2. Extend it with domain-specific terms (optional)
3. Compile it into a binary `.radixor.gz` file
4. Deploy the compiled artifact
5. Load it using `loadBinary(...)`
This avoids:
- runtime parsing overhead
- repeated compilation
- startup latency
## Example workflow
```java
// 1. Load bundled dictionary
FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
// 2. Modify (optional)
FrequencyTrie.Builder<String> builder =
FrequencyTrieBuilders.copyOf(
base,
String[]::new,
ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
)
);
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
// 3. Compile
FrequencyTrie<String> compiled = builder.build();
// 4. Save
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
```
## Limitations
* bundled dictionaries are **general-purpose**
* they may not reflect:
* domain-specific usage
* rare or specialized vocabulary
* organization-specific terminology
## Next steps
* [Quick start](quick-start.md)
* [Dictionary format](dictionary-format.md)
* [CLI compilation](cli-compilation.md)
* [Programmatic usage](programmatic-usage.md)
## Summary
Radixor’s built-in language support provides:
* immediate usability
* reference datasets
* a starting point for customization
For production systems, they are best used as:
* a baseline
* a seed for further extension
* a source for compiled deployment artifacts