docs: improve README, MkDocs content, branding assets, and site polish

This commit is contained in:
2026-04-19 00:18:42 +02:00
parent db79dd2d4f
commit 0b674a39a8
19 changed files with 1836 additions and 1698 deletions

View File

@@ -1,15 +1,8 @@
# Built-in Languages
Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
These built-in resources are useful for:
- quick integration
- testing and evaluation
- reference behavior
- prototyping search pipelines
Radixor provides a set of bundled stemmer dictionaries that can be loaded directly without preparing custom lexical data first.
These resources are intended as practical default dictionaries for common use. They provide a solid starting point for evaluation, integration, and general-purpose stemming workloads, while still fitting naturally into workflows where the bundled baseline is later refined, extended, or replaced by a custom dictionary.
## Overview
@@ -19,34 +12,30 @@ Bundled dictionaries are exposed through:
StemmerPatchTrieLoader.Language
```
They are packaged with the library and loaded from the classpath.
They are packaged with the library as text resources and compiled into a `FrequencyTrie<String>` when loaded.
## Supported languages
The following language identifiers are currently available:
| Language | Enum constant | Description |
|----------|------------------|------------------------------|
| Danish | `DA_DK` | Danish |
| German | `DE_DE` | German |
| Spanish | `ES_ES` | Spanish |
| French | `FR_FR` | French |
| Italian | `IT_IT` | Italian |
| Dutch | `NL_NL` | Dutch |
| Norwegian| `NO_NO` | Norwegian |
| Portuguese| `PT_PT` | Portuguese |
| Russian | `RU_RU` | Russian |
| Swedish | `SV_SE` | Swedish |
| English | `US_UK` | Standard English |
| English | `US_UK_PROFI` | Extended English dictionary |
The following bundled language identifiers are currently available:
| Language | Enum constant | Notes |
|---|---|---|
| Danish | `DA_DK` | Bundled general-purpose dictionary |
| German | `DE_DE` | Bundled general-purpose dictionary |
| Spanish | `ES_ES` | Bundled general-purpose dictionary |
| French | `FR_FR` | Bundled general-purpose dictionary |
| Italian | `IT_IT` | Bundled general-purpose dictionary |
| Dutch | `NL_NL` | Bundled general-purpose dictionary |
| Norwegian | `NO_NO` | Bundled general-purpose dictionary |
| Portuguese | `PT_PT` | Bundled general-purpose dictionary |
| Russian | `RU_RU` | Currently supplied in normalized transliterated form |
| Swedish | `SV_SE` | Bundled general-purpose dictionary |
| English | `US_UK` | Standard English dictionary |
| English | `US_UK_PROFI` | Extended English dictionary |
## Basic usage
Load a bundled stemmer:
Load a bundled stemmer like this:
```java
import java.io.IOException;
@@ -57,194 +46,177 @@ import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class BuiltInExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
private BuiltInExample() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
}
}
```
The loader reads the bundled dictionary resource, parses the textual entries, derives patch-command mappings, and compiles the result into a read-only trie.
## Example: stemming with `US_UK_PROFI`
```java
import java.io.IOException;
import org.egothor.stemmer.*;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.PatchCommandEncoder;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class EnglishExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
private EnglishExample() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
String word = "running";
String patch = trie.get(word);
String stem = PatchCommandEncoder.apply(word, patch);
final String word = "running";
final String patch = trie.get(word);
final String stem = PatchCommandEncoder.apply(word, patch);
System.out.println(word + " -> " + stem);
}
}
```
## `US_UK` and `US_UK_PROFI`
## `US_UK` vs `US_UK_PROFI`
Radixor currently provides two bundled English variants.
### `US_UK`
* smaller dictionary
* faster load time
* suitable for lightweight use cases
`US_UK` is the lighter-weight bundled English resource. It is suitable where a smaller default dictionary is preferred and maximal lexical coverage is not the primary goal.
### `US_UK_PROFI`
* larger and more complete dataset
* better coverage of word forms
* improved stemming quality
* slightly larger memory footprint
`US_UK_PROFI` is the more extensive bundled English resource. It offers broader lexical coverage and is the better default for most applications that want stronger out-of-the-box behavior.
### Recommendation
Use:
For most English-language deployments, prefer:
```
```text
US_UK_PROFI
```
for most applications unless memory constraints are strict.
Use `US_UK` when a smaller bundled baseline is more appropriate.
## Intended role of bundled dictionaries
Bundled dictionaries should be understood as **general-purpose default resources**.
## How bundled dictionaries are loaded
They are a good fit when:
Internally:
- a supported language is already available,
- immediate usability matters,
- a reasonable baseline is sufficient,
- the goal is evaluation, prototyping, or straightforward integration.
- dictionaries are stored as text resources
- parsed using `StemmerDictionaryParser`
- compiled into a trie at load time
They are also well suited to staged refinement workflows in which the bundled base is loaded first, then extended with domain-specific vocabulary, and finally persisted as a custom binary artifact.
This means:
## Character representation
- first load includes parsing + compilation cost
- subsequent usage is fast
The current bundled resources follow a pragmatic normalization convention.
At present, bundled dictionaries are supplied in normalized plain-ASCII form. For some languages, this is simply a lightweight maintenance convention. For others, especially languages commonly written in another script, it reflects a transliterated lexical resource. Russian is the clearest example in the current bundled set.
This convention belongs to the supplied dictionary resources, not to the core stemming model. The parser reads UTF-8 text, the dictionary model works with ordinary Java strings, and the trie and patch-command mechanism operate on general character sequences. In practical terms, the architecture is compatible with native-script dictionaries when suitable lexical resources are available.
## When to use bundled languages
## When to prefer custom dictionaries
Bundled dictionaries are suitable when:
A custom dictionary is usually the better choice when:
- you need quick results without preparing custom data
- you are prototyping or experimenting
- your language requirements match the provided datasets
## When to use custom dictionaries
You should prefer custom dictionaries when:
- domain-specific vocabulary is important
- accuracy requirements are high
- you need full control over stemming behavior
Typical examples:
- technical terminology
- product catalogs
- biomedical text
- legal or financial language
- domain-specific vocabulary materially affects stemming quality,
- lexical coverage must be controlled more precisely,
- a stronger language resource is available than the bundled baseline,
- native-script support is needed beyond the currently bundled resources.
Typical examples include:
- technical terminology,
- biomedical language,
- legal or financial vocabulary,
- organization-specific product and process names,
- language resources maintained in native scripts.
## Production recommendation
For production systems:
For production systems, the most robust workflow is usually:
1. Load a bundled dictionary
2. Extend it with domain-specific terms (optional)
3. Compile it into a binary `.radixor.gz` file
4. Deploy the compiled artifact
5. Load it using `loadBinary(...)`
1. start from a bundled dictionary when it is suitable,
2. extend it with domain-specific forms if needed,
3. compile or rebuild it into a binary `.radixor.gz` artifact,
4. deploy that compiled artifact,
5. load it at runtime using `loadBinary(...)`.
This avoids:
This avoids repeated startup parsing and makes the deployed stemming behavior explicit and versionable.
- runtime parsing overhead
- repeated compilation
- startup latency
## Example workflow
## Example refinement workflow
```java
// 1. Load bundled dictionary
FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
import java.io.IOException;
import java.nio.file.Path;
// 2. Modify (optional)
FrequencyTrie.Builder<String> builder =
FrequencyTrieBuilders.copyOf(
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.FrequencyTrieBuilders;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.ReductionSettings;
import org.egothor.stemmer.StemmerPatchTrieBinaryIO;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class BundledRefinementExample {
private BundledRefinementExample() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
final FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
final FrequencyTrie.Builder<String> builder = FrequencyTrieBuilders.copyOf(
base,
String[]::new,
ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
)
);
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS));
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
builder.put("microservices", "Na");
// 3. Compile
FrequencyTrie<String> compiled = builder.build();
final FrequencyTrie<String> compiled = builder.build();
// 4. Save
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
}
}
```
## Extending language support
The built-in set is intentionally a practical baseline rather than a closed catalog. High-quality dictionaries for additional languages, improved language coverage, and stronger native-script resources are all natural extension paths for the project.
## Limitations
* bundled dictionaries are **general-purpose**
* they may not reflect:
* domain-specific usage
* rare or specialized vocabulary
* organization-specific terminology
What matters most is not only the number of entries, but the quality, consistency, and operational usefulness of the lexical resource being added.
## Next steps
* [Quick start](quick-start.md)
* [Dictionary format](dictionary-format.md)
* [CLI compilation](cli-compilation.md)
* [Programmatic usage](programmatic-usage.md)
- [Quick start](quick-start.md)
- [Dictionary format](dictionary-format.md)
- [CLI compilation](cli-compilation.md)
- [Programmatic usage](programmatic-usage.md)
## Summary
Radixors built-in language support provides:
* immediate usability
* reference datasets
* a starting point for customization
For production systems, they are best used as:
* a baseline
* a seed for further extension
* a source for compiled deployment artifacts
Radixors built-in language support provides immediate usability, practical default dictionaries, and a strong starting point for custom refinement. The current bundled resources follow a pragmatic normalization convention, while the underlying architecture remains well suited to richer language resources and future extensions.