Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
2026-04-13 02:10:46 +02:00
parent 15248c92c9
commit 038514bad0
64 changed files with 190190 additions and 20 deletions
--- a/docs/built-in-languages.md
+++ b/docs/built-in-languages.md
@@ -0,0 +1,252 @@
+# Built-in Languages
+
+> ← Back to [README.md](../README.md)
+
+Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
+
+These built-in resources are useful for:
+
+- quick integration
+- testing and evaluation
+- reference behavior
+- prototyping search pipelines
+
+
+
+## Overview
+
+Bundled dictionaries are exposed through:
+
+```java
+StemmerPatchTrieLoader.Language
+```
+
+They are packaged with the library and loaded from the classpath.
+
+
+
+## Supported languages
+
+The following language identifiers are currently available:
+
+| Language | Enum constant     | Description                  |
+|----------|------------------|------------------------------|
+| Danish   | `DA_DK`          | Danish                       |
+| German   | `DE_DE`          | German                       |
+| Spanish  | `ES_ES`          | Spanish                      |
+| French   | `FR_FR`          | French                       |
+| Italian  | `IT_IT`          | Italian                      |
+| Dutch    | `NL_NL`          | Dutch                        |
+| Norwegian| `NO_NO`          | Norwegian                    |
+| Portuguese| `PT_PT`         | Portuguese                   |
+| Russian  | `RU_RU`          | Russian                      |
+| Swedish  | `SV_SE`          | Swedish                      |
+| English  | `US_UK`          | Standard English             |
+| English  | `US_UK_PROFI`    | Extended English dictionary  |
+
+
+
+## Basic usage
+
+Load a bundled stemmer:
+
+```java
+import java.io.IOException;
+
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+public final class BuiltInExample {
+
+    public static void main(String[] args) throws IOException {
+        FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+                StemmerPatchTrieLoader.Language.US_UK_PROFI,
+                true,
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+        );
+    }
+}
+```
+
+
+
+## Example: stemming with `US_UK_PROFI`
+
+```java
+import java.io.IOException;
+
+import org.egothor.stemmer.*;
+
+public final class EnglishExample {
+
+    public static void main(String[] args) throws IOException {
+        FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+                StemmerPatchTrieLoader.Language.US_UK_PROFI,
+                true,
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+        );
+
+        String word = "running";
+        String patch = trie.get(word);
+        String stem = PatchCommandEncoder.apply(word, patch);
+
+        System.out.println(word + " -> " + stem);
+    }
+}
+```
+
+
+
+## `US_UK` vs `US_UK_PROFI`
+
+### `US_UK`
+
+* smaller dictionary
+* faster load time
+* suitable for lightweight use cases
+
+### `US_UK_PROFI`
+
+* larger and more complete dataset
+* better coverage of word forms
+* improved stemming quality
+* slightly larger memory footprint
+
+### Recommendation
+
+Use:
+
+````
+US_UK_PROFI
+```
+
+for most applications unless memory constraints are strict.
+
+
+
+## How bundled dictionaries are loaded
+
+Internally:
+
+- dictionaries are stored as text resources
+- parsed using `StemmerDictionaryParser`
+- compiled into a trie at load time
+
+This means:
+
+- first load includes parsing + compilation cost
+- subsequent usage is fast
+
+
+
+## When to use bundled languages
+
+Bundled dictionaries are suitable when:
+
+- you need quick results without preparing custom data
+- you are prototyping or experimenting
+- your language requirements match the provided datasets
+
+
+
+## When to use custom dictionaries
+
+You should prefer custom dictionaries when:
+
+- domain-specific vocabulary is important
+- accuracy requirements are high
+- you need full control over stemming behavior
+
+Typical examples:
+
+- technical terminology
+- product catalogs
+- biomedical text
+- legal or financial language
+
+
+
+## Production recommendation
+
+For production systems:
+
+1. Load a bundled dictionary
+2. Extend it with domain-specific terms (optional)
+3. Compile it into a binary `.radixor.gz` file
+4. Deploy the compiled artifact
+5. Load it using `loadBinary(...)`
+
+This avoids:
+
+- runtime parsing overhead
+- repeated compilation
+- startup latency
+
+
+
+## Example workflow
+
+```java
+// 1. Load bundled dictionary
+FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
+        StemmerPatchTrieLoader.Language.US_UK_PROFI,
+        true,
+        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+);
+
+// 2. Modify (optional)
+FrequencyTrie.Builder<String> builder =
+        FrequencyTrieBuilders.copyOf(
+                base,
+                String[]::new,
+                ReductionSettings.withDefaults(
+                        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
+                )
+        );
+
+builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
+
+// 3. Compile
+FrequencyTrie<String> compiled = builder.build();
+
+// 4. Save
+StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
+```
+
+
+
+## Limitations
+
+* bundled dictionaries are **general-purpose**
+* they may not reflect:
+
+  * domain-specific usage
+  * rare or specialized vocabulary
+  * organization-specific terminology
+
+
+
+## Next steps
+
+* [Quick start](quick-start.md)
+* [Dictionary format](dictionary-format.md)
+* [CLI compilation](cli-compilation.md)
+* [Programmatic usage](programmatic-usage.md)
+
+
+
+## Summary
+
+Radixor’s built-in language support provides:
+
+* immediate usability
+* reference datasets
+* a starting point for customization
+
+For production systems, they are best used as:
+
+* a baseline
+* a seed for further extension
+* a source for compiled deployment artifacts
+