docs: improve README, MkDocs content, branding assets, and site polish
This commit is contained in:
@@ -1,15 +1,8 @@
|
||||
# Built-in Languages
|
||||
|
||||
Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
|
||||
|
||||
These built-in resources are useful for:
|
||||
|
||||
- quick integration
|
||||
- testing and evaluation
|
||||
- reference behavior
|
||||
- prototyping search pipelines
|
||||
|
||||
Radixor provides a set of bundled stemmer dictionaries that can be loaded directly without preparing custom lexical data first.
|
||||
|
||||
These resources are intended as practical default dictionaries for common use. They provide a solid starting point for evaluation, integration, and general-purpose stemming workloads, while still fitting naturally into workflows where the bundled baseline is later refined, extended, or replaced by a custom dictionary.
|
||||
|
||||
## Overview
|
||||
|
||||
@@ -19,34 +12,30 @@ Bundled dictionaries are exposed through:
|
||||
StemmerPatchTrieLoader.Language
|
||||
```
|
||||
|
||||
They are packaged with the library and loaded from the classpath.
|
||||
|
||||
|
||||
They are packaged with the library as text resources and compiled into a `FrequencyTrie<String>` when loaded.
|
||||
|
||||
## Supported languages
|
||||
|
||||
The following language identifiers are currently available:
|
||||
|
||||
| Language | Enum constant | Description |
|
||||
|----------|------------------|------------------------------|
|
||||
| Danish | `DA_DK` | Danish |
|
||||
| German | `DE_DE` | German |
|
||||
| Spanish | `ES_ES` | Spanish |
|
||||
| French | `FR_FR` | French |
|
||||
| Italian | `IT_IT` | Italian |
|
||||
| Dutch | `NL_NL` | Dutch |
|
||||
| Norwegian| `NO_NO` | Norwegian |
|
||||
| Portuguese| `PT_PT` | Portuguese |
|
||||
| Russian | `RU_RU` | Russian |
|
||||
| Swedish | `SV_SE` | Swedish |
|
||||
| English | `US_UK` | Standard English |
|
||||
| English | `US_UK_PROFI` | Extended English dictionary |
|
||||
|
||||
The following bundled language identifiers are currently available:
|
||||
|
||||
| Language | Enum constant | Notes |
|
||||
|---|---|---|
|
||||
| Danish | `DA_DK` | Bundled general-purpose dictionary |
|
||||
| German | `DE_DE` | Bundled general-purpose dictionary |
|
||||
| Spanish | `ES_ES` | Bundled general-purpose dictionary |
|
||||
| French | `FR_FR` | Bundled general-purpose dictionary |
|
||||
| Italian | `IT_IT` | Bundled general-purpose dictionary |
|
||||
| Dutch | `NL_NL` | Bundled general-purpose dictionary |
|
||||
| Norwegian | `NO_NO` | Bundled general-purpose dictionary |
|
||||
| Portuguese | `PT_PT` | Bundled general-purpose dictionary |
|
||||
| Russian | `RU_RU` | Currently supplied in normalized transliterated form |
|
||||
| Swedish | `SV_SE` | Bundled general-purpose dictionary |
|
||||
| English | `US_UK` | Standard English dictionary |
|
||||
| English | `US_UK_PROFI` | Extended English dictionary |
|
||||
|
||||
## Basic usage
|
||||
|
||||
Load a bundled stemmer:
|
||||
Load a bundled stemmer like this:
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
@@ -57,194 +46,177 @@ import org.egothor.stemmer.StemmerPatchTrieLoader;
|
||||
|
||||
public final class BuiltInExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
private BuiltInExample() {
|
||||
throw new AssertionError("No instances.");
|
||||
}
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
The loader reads the bundled dictionary resource, parses the textual entries, derives patch-command mappings, and compiles the result into a read-only trie.
|
||||
|
||||
## Example: stemming with `US_UK_PROFI`
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
|
||||
import org.egothor.stemmer.*;
|
||||
import org.egothor.stemmer.FrequencyTrie;
|
||||
import org.egothor.stemmer.PatchCommandEncoder;
|
||||
import org.egothor.stemmer.ReductionMode;
|
||||
import org.egothor.stemmer.StemmerPatchTrieLoader;
|
||||
|
||||
public final class EnglishExample {
|
||||
|
||||
public static void main(String[] args) throws IOException {
|
||||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
private EnglishExample() {
|
||||
throw new AssertionError("No instances.");
|
||||
}
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||
|
||||
String word = "running";
|
||||
String patch = trie.get(word);
|
||||
String stem = PatchCommandEncoder.apply(word, patch);
|
||||
final String word = "running";
|
||||
final String patch = trie.get(word);
|
||||
final String stem = PatchCommandEncoder.apply(word, patch);
|
||||
|
||||
System.out.println(word + " -> " + stem);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## `US_UK` and `US_UK_PROFI`
|
||||
|
||||
|
||||
## `US_UK` vs `US_UK_PROFI`
|
||||
Radixor currently provides two bundled English variants.
|
||||
|
||||
### `US_UK`
|
||||
|
||||
* smaller dictionary
|
||||
* faster load time
|
||||
* suitable for lightweight use cases
|
||||
`US_UK` is the lighter-weight bundled English resource. It is suitable where a smaller default dictionary is preferred and maximal lexical coverage is not the primary goal.
|
||||
|
||||
### `US_UK_PROFI`
|
||||
|
||||
* larger and more complete dataset
|
||||
* better coverage of word forms
|
||||
* improved stemming quality
|
||||
* slightly larger memory footprint
|
||||
`US_UK_PROFI` is the more extensive bundled English resource. It offers broader lexical coverage and is the better default for most applications that want stronger out-of-the-box behavior.
|
||||
|
||||
### Recommendation
|
||||
|
||||
Use:
|
||||
For most English-language deployments, prefer:
|
||||
|
||||
```
|
||||
```text
|
||||
US_UK_PROFI
|
||||
```
|
||||
|
||||
for most applications unless memory constraints are strict.
|
||||
Use `US_UK` when a smaller bundled baseline is more appropriate.
|
||||
|
||||
## Intended role of bundled dictionaries
|
||||
|
||||
Bundled dictionaries should be understood as **general-purpose default resources**.
|
||||
|
||||
## How bundled dictionaries are loaded
|
||||
They are a good fit when:
|
||||
|
||||
Internally:
|
||||
- a supported language is already available,
|
||||
- immediate usability matters,
|
||||
- a reasonable baseline is sufficient,
|
||||
- the goal is evaluation, prototyping, or straightforward integration.
|
||||
|
||||
- dictionaries are stored as text resources
|
||||
- parsed using `StemmerDictionaryParser`
|
||||
- compiled into a trie at load time
|
||||
They are also well suited to staged refinement workflows in which the bundled base is loaded first, then extended with domain-specific vocabulary, and finally persisted as a custom binary artifact.
|
||||
|
||||
This means:
|
||||
## Character representation
|
||||
|
||||
- first load includes parsing + compilation cost
|
||||
- subsequent usage is fast
|
||||
The current bundled resources follow a pragmatic normalization convention.
|
||||
|
||||
At present, bundled dictionaries are supplied in normalized plain-ASCII form. For some languages, this is simply a lightweight maintenance convention. For others, especially languages commonly written in another script, it reflects a transliterated lexical resource. Russian is the clearest example in the current bundled set.
|
||||
|
||||
This convention belongs to the supplied dictionary resources, not to the core stemming model. The parser reads UTF-8 text, the dictionary model works with ordinary Java strings, and the trie and patch-command mechanism operate on general character sequences. In practical terms, the architecture is compatible with native-script dictionaries when suitable lexical resources are available.
|
||||
|
||||
## When to use bundled languages
|
||||
## When to prefer custom dictionaries
|
||||
|
||||
Bundled dictionaries are suitable when:
|
||||
A custom dictionary is usually the better choice when:
|
||||
|
||||
- you need quick results without preparing custom data
|
||||
- you are prototyping or experimenting
|
||||
- your language requirements match the provided datasets
|
||||
|
||||
|
||||
|
||||
## When to use custom dictionaries
|
||||
|
||||
You should prefer custom dictionaries when:
|
||||
|
||||
- domain-specific vocabulary is important
|
||||
- accuracy requirements are high
|
||||
- you need full control over stemming behavior
|
||||
|
||||
Typical examples:
|
||||
|
||||
- technical terminology
|
||||
- product catalogs
|
||||
- biomedical text
|
||||
- legal or financial language
|
||||
- domain-specific vocabulary materially affects stemming quality,
|
||||
- lexical coverage must be controlled more precisely,
|
||||
- a stronger language resource is available than the bundled baseline,
|
||||
- native-script support is needed beyond the currently bundled resources.
|
||||
|
||||
Typical examples include:
|
||||
|
||||
- technical terminology,
|
||||
- biomedical language,
|
||||
- legal or financial vocabulary,
|
||||
- organization-specific product and process names,
|
||||
- language resources maintained in native scripts.
|
||||
|
||||
## Production recommendation
|
||||
|
||||
For production systems:
|
||||
For production systems, the most robust workflow is usually:
|
||||
|
||||
1. Load a bundled dictionary
|
||||
2. Extend it with domain-specific terms (optional)
|
||||
3. Compile it into a binary `.radixor.gz` file
|
||||
4. Deploy the compiled artifact
|
||||
5. Load it using `loadBinary(...)`
|
||||
1. start from a bundled dictionary when it is suitable,
|
||||
2. extend it with domain-specific forms if needed,
|
||||
3. compile or rebuild it into a binary `.radixor.gz` artifact,
|
||||
4. deploy that compiled artifact,
|
||||
5. load it at runtime using `loadBinary(...)`.
|
||||
|
||||
This avoids:
|
||||
This avoids repeated startup parsing and makes the deployed stemming behavior explicit and versionable.
|
||||
|
||||
- runtime parsing overhead
|
||||
- repeated compilation
|
||||
- startup latency
|
||||
|
||||
|
||||
|
||||
## Example workflow
|
||||
## Example refinement workflow
|
||||
|
||||
```java
|
||||
// 1. Load bundled dictionary
|
||||
FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
);
|
||||
import java.io.IOException;
|
||||
import java.nio.file.Path;
|
||||
|
||||
// 2. Modify (optional)
|
||||
FrequencyTrie.Builder<String> builder =
|
||||
FrequencyTrieBuilders.copyOf(
|
||||
import org.egothor.stemmer.FrequencyTrie;
|
||||
import org.egothor.stemmer.FrequencyTrieBuilders;
|
||||
import org.egothor.stemmer.ReductionMode;
|
||||
import org.egothor.stemmer.ReductionSettings;
|
||||
import org.egothor.stemmer.StemmerPatchTrieBinaryIO;
|
||||
import org.egothor.stemmer.StemmerPatchTrieLoader;
|
||||
|
||||
public final class BundledRefinementExample {
|
||||
|
||||
private BundledRefinementExample() {
|
||||
throw new AssertionError("No instances.");
|
||||
}
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||
|
||||
final FrequencyTrie.Builder<String> builder = FrequencyTrieBuilders.copyOf(
|
||||
base,
|
||||
String[]::new,
|
||||
ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||||
)
|
||||
);
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS));
|
||||
|
||||
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
|
||||
builder.put("microservices", "Na");
|
||||
|
||||
// 3. Compile
|
||||
FrequencyTrie<String> compiled = builder.build();
|
||||
final FrequencyTrie<String> compiled = builder.build();
|
||||
|
||||
// 4. Save
|
||||
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
|
||||
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Extending language support
|
||||
|
||||
The built-in set is intentionally a practical baseline rather than a closed catalog. High-quality dictionaries for additional languages, improved language coverage, and stronger native-script resources are all natural extension paths for the project.
|
||||
|
||||
## Limitations
|
||||
|
||||
* bundled dictionaries are **general-purpose**
|
||||
* they may not reflect:
|
||||
|
||||
* domain-specific usage
|
||||
* rare or specialized vocabulary
|
||||
* organization-specific terminology
|
||||
|
||||
|
||||
What matters most is not only the number of entries, but the quality, consistency, and operational usefulness of the lexical resource being added.
|
||||
|
||||
## Next steps
|
||||
|
||||
* [Quick start](quick-start.md)
|
||||
* [Dictionary format](dictionary-format.md)
|
||||
* [CLI compilation](cli-compilation.md)
|
||||
* [Programmatic usage](programmatic-usage.md)
|
||||
|
||||
|
||||
- [Quick start](quick-start.md)
|
||||
- [Dictionary format](dictionary-format.md)
|
||||
- [CLI compilation](cli-compilation.md)
|
||||
- [Programmatic usage](programmatic-usage.md)
|
||||
|
||||
## Summary
|
||||
|
||||
Radixor’s built-in language support provides:
|
||||
|
||||
* immediate usability
|
||||
* reference datasets
|
||||
* a starting point for customization
|
||||
|
||||
For production systems, they are best used as:
|
||||
|
||||
* a baseline
|
||||
* a seed for further extension
|
||||
* a source for compiled deployment artifacts
|
||||
|
||||
Radixor’s built-in language support provides immediate usability, practical default dictionaries, and a strong starting point for custom refinement. The current bundled resources follow a pragmatic normalization convention, while the underlying architecture remains well suited to richer language resources and future extensions.
|
||||
|
||||
Reference in New Issue
Block a user