docs: improve README, MkDocs content, branding assets, and site polish

2026-04-19 00:18:42 +02:00
parent db79dd2d4f
commit 0b674a39a8
19 changed files with 1836 additions and 1698 deletions
--- a/docs/built-in-languages.md
+++ b/docs/built-in-languages.md
@@ -1,15 +1,8 @@
 # Built-in Languages

-Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
-
-These built-in resources are useful for:
-
- quick integration
- testing and evaluation
- reference behavior
- prototyping search pipelines
-
+Radixor provides a set of bundled stemmer dictionaries that can be loaded directly without preparing custom lexical data first.

+These resources are intended as practical default dictionaries for common use. They provide a solid starting point for evaluation, integration, and general-purpose stemming workloads, while still fitting naturally into workflows where the bundled baseline is later refined, extended, or replaced by a custom dictionary.

 ## Overview

@@ -19,34 +12,30 @@ Bundled dictionaries are exposed through:
 StemmerPatchTrieLoader.Language
 ```

-They are packaged with the library and loaded from the classpath.
-
-
+They are packaged with the library as text resources and compiled into a `FrequencyTrie<String>` when loaded.

 ## Supported languages

-The following language identifiers are currently available:
-
-| Language | Enum constant     | Description                  |
-|----------|------------------|------------------------------|
-| Danish   | `DA_DK`          | Danish                       |
-| German   | `DE_DE`          | German                       |
-| Spanish  | `ES_ES`          | Spanish                      |
-| French   | `FR_FR`          | French                       |
-| Italian  | `IT_IT`          | Italian                      |
-| Dutch    | `NL_NL`          | Dutch                        |
-| Norwegian| `NO_NO`          | Norwegian                    |
-| Portuguese| `PT_PT`         | Portuguese                   |
-| Russian  | `RU_RU`          | Russian                      |
-| Swedish  | `SV_SE`          | Swedish                      |
-| English  | `US_UK`          | Standard English             |
-| English  | `US_UK_PROFI`    | Extended English dictionary  |
-
+The following bundled language identifiers are currently available:

+| Language | Enum constant | Notes |
+|---|---|---|
+| Danish | `DA_DK` | Bundled general-purpose dictionary |
+| German | `DE_DE` | Bundled general-purpose dictionary |
+| Spanish | `ES_ES` | Bundled general-purpose dictionary |
+| French | `FR_FR` | Bundled general-purpose dictionary |
+| Italian | `IT_IT` | Bundled general-purpose dictionary |
+| Dutch | `NL_NL` | Bundled general-purpose dictionary |
+| Norwegian | `NO_NO` | Bundled general-purpose dictionary |
+| Portuguese | `PT_PT` | Bundled general-purpose dictionary |
+| Russian | `RU_RU` | Currently supplied in normalized transliterated form |
+| Swedish | `SV_SE` | Bundled general-purpose dictionary |
+| English | `US_UK` | Standard English dictionary |
+| English | `US_UK_PROFI` | Extended English dictionary |

 ## Basic usage

-Load a bundled stemmer:
+Load a bundled stemmer like this:

 ```java
 import java.io.IOException;
@@ -57,194 +46,177 @@ import org.egothor.stemmer.StemmerPatchTrieLoader;

 public final class BuiltInExample {

-    public static void main(String[] args) throws IOException {
-        FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+    private BuiltInExample() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
                StemmerPatchTrieLoader.Language.US_UK_PROFI,
                true,
-                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
-        );
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
    }
 }
 ```

-
+The loader reads the bundled dictionary resource, parses the textual entries, derives patch-command mappings, and compiles the result into a read-only trie.

 ## Example: stemming with `US_UK_PROFI`

 ```java
 import java.io.IOException;

-import org.egothor.stemmer.*;
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.PatchCommandEncoder;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.StemmerPatchTrieLoader;

 public final class EnglishExample {

-    public static void main(String[] args) throws IOException {
-        FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
+    private EnglishExample() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
                StemmerPatchTrieLoader.Language.US_UK_PROFI,
                true,
-                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
-        );
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);

-        String word = "running";
-        String patch = trie.get(word);
-        String stem = PatchCommandEncoder.apply(word, patch);
+        final String word = "running";
+        final String patch = trie.get(word);
+        final String stem = PatchCommandEncoder.apply(word, patch);

        System.out.println(word + " -> " + stem);
    }
 }
 ```

+## `US_UK` and `US_UK_PROFI`

-
-## `US_UK` vs `US_UK_PROFI`
+Radixor currently provides two bundled English variants.

 ### `US_UK`

-* smaller dictionary
-* faster load time
-* suitable for lightweight use cases
+`US_UK` is the lighter-weight bundled English resource. It is suitable where a smaller default dictionary is preferred and maximal lexical coverage is not the primary goal.

 ### `US_UK_PROFI`

-* larger and more complete dataset
-* better coverage of word forms
-* improved stemming quality
-* slightly larger memory footprint
+`US_UK_PROFI` is the more extensive bundled English resource. It offers broader lexical coverage and is the better default for most applications that want stronger out-of-the-box behavior.

 ### Recommendation

-Use:
+For most English-language deployments, prefer:

-```
+```text
 US_UK_PROFI
 ```

-for most applications unless memory constraints are strict.
+Use `US_UK` when a smaller bundled baseline is more appropriate.

+## Intended role of bundled dictionaries

+Bundled dictionaries should be understood as **general-purpose default resources**.

-## How bundled dictionaries are loaded
+They are a good fit when:

-Internally:
+- a supported language is already available,
+- immediate usability matters,
+- a reasonable baseline is sufficient,
+- the goal is evaluation, prototyping, or straightforward integration.

- dictionaries are stored as text resources
- parsed using `StemmerDictionaryParser`
- compiled into a trie at load time
+They are also well suited to staged refinement workflows in which the bundled base is loaded first, then extended with domain-specific vocabulary, and finally persisted as a custom binary artifact.

-This means:
+## Character representation

- first load includes parsing + compilation cost
- subsequent usage is fast
+The current bundled resources follow a pragmatic normalization convention.

+At present, bundled dictionaries are supplied in normalized plain-ASCII form. For some languages, this is simply a lightweight maintenance convention. For others, especially languages commonly written in another script, it reflects a transliterated lexical resource. Russian is the clearest example in the current bundled set.

+This convention belongs to the supplied dictionary resources, not to the core stemming model. The parser reads UTF-8 text, the dictionary model works with ordinary Java strings, and the trie and patch-command mechanism operate on general character sequences. In practical terms, the architecture is compatible with native-script dictionaries when suitable lexical resources are available.

-## When to use bundled languages
+## When to prefer custom dictionaries

-Bundled dictionaries are suitable when:
+A custom dictionary is usually the better choice when:

- you need quick results without preparing custom data
- you are prototyping or experimenting
- your language requirements match the provided datasets
-
-
-
-## When to use custom dictionaries
-
-You should prefer custom dictionaries when:
-
- domain-specific vocabulary is important
- accuracy requirements are high
- you need full control over stemming behavior
-
-Typical examples:
-
- technical terminology
- product catalogs
- biomedical text
- legal or financial language
+- domain-specific vocabulary materially affects stemming quality,
+- lexical coverage must be controlled more precisely,
+- a stronger language resource is available than the bundled baseline,
+- native-script support is needed beyond the currently bundled resources.

+Typical examples include:

+- technical terminology,
+- biomedical language,
+- legal or financial vocabulary,
+- organization-specific product and process names,
+- language resources maintained in native scripts.

 ## Production recommendation

-For production systems:
+For production systems, the most robust workflow is usually:

-1. Load a bundled dictionary
-2. Extend it with domain-specific terms (optional)
-3. Compile it into a binary `.radixor.gz` file
-4. Deploy the compiled artifact
-5. Load it using `loadBinary(...)`
+1. start from a bundled dictionary when it is suitable,
+2. extend it with domain-specific forms if needed,
+3. compile or rebuild it into a binary `.radixor.gz` artifact,
+4. deploy that compiled artifact,
+5. load it at runtime using `loadBinary(...)`.

-This avoids:
+This avoids repeated startup parsing and makes the deployed stemming behavior explicit and versionable.

- runtime parsing overhead
- repeated compilation
- startup latency
-
-
-
-## Example workflow
+## Example refinement workflow

 ```java
-// 1. Load bundled dictionary
-FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
-        StemmerPatchTrieLoader.Language.US_UK_PROFI,
-        true,
-        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
-);
+import java.io.IOException;
+import java.nio.file.Path;

-// 2. Modify (optional)
-FrequencyTrie.Builder<String> builder =
-        FrequencyTrieBuilders.copyOf(
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.FrequencyTrieBuilders;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.ReductionSettings;
+import org.egothor.stemmer.StemmerPatchTrieBinaryIO;
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+public final class BundledRefinementExample {
+
+    private BundledRefinementExample() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        final FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
+                StemmerPatchTrieLoader.Language.US_UK_PROFI,
+                true,
+                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
+
+        final FrequencyTrie.Builder<String> builder = FrequencyTrieBuilders.copyOf(
                base,
                String[]::new,
                ReductionSettings.withDefaults(
-                        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
-                )
-        );
+                        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS));

-builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
+        builder.put("microservices", "Na");

-// 3. Compile
-FrequencyTrie<String> compiled = builder.build();
+        final FrequencyTrie<String> compiled = builder.build();

-// 4. Save
-StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
+        StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
+    }
+}
 ```

+## Extending language support

+The built-in set is intentionally a practical baseline rather than a closed catalog. High-quality dictionaries for additional languages, improved language coverage, and stronger native-script resources are all natural extension paths for the project.

-## Limitations
-
-* bundled dictionaries are **general-purpose**
-* they may not reflect:
-
-  * domain-specific usage
-  * rare or specialized vocabulary
-  * organization-specific terminology
-
-
+What matters most is not only the number of entries, but the quality, consistency, and operational usefulness of the lexical resource being added.

 ## Next steps

-* [Quick start](quick-start.md)
-* [Dictionary format](dictionary-format.md)
-* [CLI compilation](cli-compilation.md)
-* [Programmatic usage](programmatic-usage.md)
-
-
+- [Quick start](quick-start.md)
+- [Dictionary format](dictionary-format.md)
+- [CLI compilation](cli-compilation.md)
+- [Programmatic usage](programmatic-usage.md)

 ## Summary

-Radixor’s built-in language support provides:
-
-* immediate usability
-* reference datasets
-* a starting point for customization
-
-For production systems, they are best used as:
-
-* a baseline
-* a seed for further extension
-* a source for compiled deployment artifacts
-
+Radixor’s built-in language support provides immediate usability, practical default dictionaries, and a strong starting point for custom refinement. The current bundled resources follow a pragmatic normalization convention, while the underlying architecture remains well suited to richer language resources and future extensions.