Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening
This commit is contained in:
2026-04-13 02:10:46 +02:00
parent 15248c92c9
commit 038514bad0
64 changed files with 190190 additions and 20 deletions

252
docs/built-in-languages.md Normal file
View File

@@ -0,0 +1,252 @@
# Built-in Languages
> ← Back to [README.md](../README.md)
Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
These built-in resources are useful for:
- quick integration
- testing and evaluation
- reference behavior
- prototyping search pipelines
## Overview
Bundled dictionaries are exposed through:
```java
StemmerPatchTrieLoader.Language
```
They are packaged with the library and loaded from the classpath.
## Supported languages
The following language identifiers are currently available:
| Language | Enum constant | Description |
|----------|------------------|------------------------------|
| Danish | `DA_DK` | Danish |
| German | `DE_DE` | German |
| Spanish | `ES_ES` | Spanish |
| French | `FR_FR` | French |
| Italian | `IT_IT` | Italian |
| Dutch | `NL_NL` | Dutch |
| Norwegian| `NO_NO` | Norwegian |
| Portuguese| `PT_PT` | Portuguese |
| Russian | `RU_RU` | Russian |
| Swedish | `SV_SE` | Swedish |
| English | `US_UK` | Standard English |
| English | `US_UK_PROFI` | Extended English dictionary |
## Basic usage
Load a bundled stemmer:
```java
import java.io.IOException;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class BuiltInExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
}
}
```
## Example: stemming with `US_UK_PROFI`
```java
import java.io.IOException;
import org.egothor.stemmer.*;
public final class EnglishExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
String word = "running";
String patch = trie.get(word);
String stem = PatchCommandEncoder.apply(word, patch);
System.out.println(word + " -> " + stem);
}
}
```
## `US_UK` vs `US_UK_PROFI`
### `US_UK`
* smaller dictionary
* faster load time
* suitable for lightweight use cases
### `US_UK_PROFI`
* larger and more complete dataset
* better coverage of word forms
* improved stemming quality
* slightly larger memory footprint
### Recommendation
Use:
````
US_UK_PROFI
```
for most applications unless memory constraints are strict.
## How bundled dictionaries are loaded
Internally:
- dictionaries are stored as text resources
- parsed using `StemmerDictionaryParser`
- compiled into a trie at load time
This means:
- first load includes parsing + compilation cost
- subsequent usage is fast
## When to use bundled languages
Bundled dictionaries are suitable when:
- you need quick results without preparing custom data
- you are prototyping or experimenting
- your language requirements match the provided datasets
## When to use custom dictionaries
You should prefer custom dictionaries when:
- domain-specific vocabulary is important
- accuracy requirements are high
- you need full control over stemming behavior
Typical examples:
- technical terminology
- product catalogs
- biomedical text
- legal or financial language
## Production recommendation
For production systems:
1. Load a bundled dictionary
2. Extend it with domain-specific terms (optional)
3. Compile it into a binary `.radixor.gz` file
4. Deploy the compiled artifact
5. Load it using `loadBinary(...)`
This avoids:
- runtime parsing overhead
- repeated compilation
- startup latency
## Example workflow
```java
// 1. Load bundled dictionary
FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
// 2. Modify (optional)
FrequencyTrie.Builder<String> builder =
FrequencyTrieBuilders.copyOf(
base,
String[]::new,
ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
)
);
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
// 3. Compile
FrequencyTrie<String> compiled = builder.build();
// 4. Save
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
```
## Limitations
* bundled dictionaries are **general-purpose**
* they may not reflect:
* domain-specific usage
* rare or specialized vocabulary
* organization-specific terminology
## Next steps
* [Quick start](quick-start.md)
* [Dictionary format](dictionary-format.md)
* [CLI compilation](cli-compilation.md)
* [Programmatic usage](programmatic-usage.md)
## Summary
Radixors built-in language support provides:
* immediate usability
* reference datasets
* a starting point for customization
For production systems, they are best used as:
* a baseline
* a seed for further extension
* a source for compiled deployment artifacts