Files
Radixor/docs/built-in-languages.md
Leo Galambos 038514bad0 Refine stemmer core, compiled trie workflow, tests, and public documentation
feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening
2026-04-13 02:10:46 +02:00

253 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Built-in Languages
> ← Back to [README.md](../README.md)
Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
These built-in resources are useful for:
- quick integration
- testing and evaluation
- reference behavior
- prototyping search pipelines
## Overview
Bundled dictionaries are exposed through:
```java
StemmerPatchTrieLoader.Language
```
They are packaged with the library and loaded from the classpath.
## Supported languages
The following language identifiers are currently available:
| Language | Enum constant | Description |
|----------|------------------|------------------------------|
| Danish | `DA_DK` | Danish |
| German | `DE_DE` | German |
| Spanish | `ES_ES` | Spanish |
| French | `FR_FR` | French |
| Italian | `IT_IT` | Italian |
| Dutch | `NL_NL` | Dutch |
| Norwegian| `NO_NO` | Norwegian |
| Portuguese| `PT_PT` | Portuguese |
| Russian | `RU_RU` | Russian |
| Swedish | `SV_SE` | Swedish |
| English | `US_UK` | Standard English |
| English | `US_UK_PROFI` | Extended English dictionary |
## Basic usage
Load a bundled stemmer:
```java
import java.io.IOException;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class BuiltInExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
}
}
```
## Example: stemming with `US_UK_PROFI`
```java
import java.io.IOException;
import org.egothor.stemmer.*;
public final class EnglishExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
String word = "running";
String patch = trie.get(word);
String stem = PatchCommandEncoder.apply(word, patch);
System.out.println(word + " -> " + stem);
}
}
```
## `US_UK` vs `US_UK_PROFI`
### `US_UK`
* smaller dictionary
* faster load time
* suitable for lightweight use cases
### `US_UK_PROFI`
* larger and more complete dataset
* better coverage of word forms
* improved stemming quality
* slightly larger memory footprint
### Recommendation
Use:
````
US_UK_PROFI
```
for most applications unless memory constraints are strict.
## How bundled dictionaries are loaded
Internally:
- dictionaries are stored as text resources
- parsed using `StemmerDictionaryParser`
- compiled into a trie at load time
This means:
- first load includes parsing + compilation cost
- subsequent usage is fast
## When to use bundled languages
Bundled dictionaries are suitable when:
- you need quick results without preparing custom data
- you are prototyping or experimenting
- your language requirements match the provided datasets
## When to use custom dictionaries
You should prefer custom dictionaries when:
- domain-specific vocabulary is important
- accuracy requirements are high
- you need full control over stemming behavior
Typical examples:
- technical terminology
- product catalogs
- biomedical text
- legal or financial language
## Production recommendation
For production systems:
1. Load a bundled dictionary
2. Extend it with domain-specific terms (optional)
3. Compile it into a binary `.radixor.gz` file
4. Deploy the compiled artifact
5. Load it using `loadBinary(...)`
This avoids:
- runtime parsing overhead
- repeated compilation
- startup latency
## Example workflow
```java
// 1. Load bundled dictionary
FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
// 2. Modify (optional)
FrequencyTrie.Builder<String> builder =
FrequencyTrieBuilders.copyOf(
base,
String[]::new,
ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
)
);
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
// 3. Compile
FrequencyTrie<String> compiled = builder.build();
// 4. Save
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
```
## Limitations
* bundled dictionaries are **general-purpose**
* they may not reflect:
* domain-specific usage
* rare or specialized vocabulary
* organization-specific terminology
## Next steps
* [Quick start](quick-start.md)
* [Dictionary format](dictionary-format.md)
* [CLI compilation](cli-compilation.md)
* [Programmatic usage](programmatic-usage.md)
## Summary
Radixors built-in language support provides:
* immediate usability
* reference datasets
* a starting point for customization
For production systems, they are best used as:
* a baseline
* a seed for further extension
* a source for compiled deployment artifacts