251 lines
5.5 KiB
Markdown
251 lines
5.5 KiB
Markdown
# Built-in Languages
|
||
|
||
Radixor provides a set of **bundled stemmer dictionaries** that can be loaded directly without preparing custom data.
|
||
|
||
These built-in resources are useful for:
|
||
|
||
- quick integration
|
||
- testing and evaluation
|
||
- reference behavior
|
||
- prototyping search pipelines
|
||
|
||
|
||
|
||
## Overview
|
||
|
||
Bundled dictionaries are exposed through:
|
||
|
||
```java
|
||
StemmerPatchTrieLoader.Language
|
||
```
|
||
|
||
They are packaged with the library and loaded from the classpath.
|
||
|
||
|
||
|
||
## Supported languages
|
||
|
||
The following language identifiers are currently available:
|
||
|
||
| Language | Enum constant | Description |
|
||
|----------|------------------|------------------------------|
|
||
| Danish | `DA_DK` | Danish |
|
||
| German | `DE_DE` | German |
|
||
| Spanish | `ES_ES` | Spanish |
|
||
| French | `FR_FR` | French |
|
||
| Italian | `IT_IT` | Italian |
|
||
| Dutch | `NL_NL` | Dutch |
|
||
| Norwegian| `NO_NO` | Norwegian |
|
||
| Portuguese| `PT_PT` | Portuguese |
|
||
| Russian | `RU_RU` | Russian |
|
||
| Swedish | `SV_SE` | Swedish |
|
||
| English | `US_UK` | Standard English |
|
||
| English | `US_UK_PROFI` | Extended English dictionary |
|
||
|
||
|
||
|
||
## Basic usage
|
||
|
||
Load a bundled stemmer:
|
||
|
||
```java
|
||
import java.io.IOException;
|
||
|
||
import org.egothor.stemmer.FrequencyTrie;
|
||
import org.egothor.stemmer.ReductionMode;
|
||
import org.egothor.stemmer.StemmerPatchTrieLoader;
|
||
|
||
public final class BuiltInExample {
|
||
|
||
public static void main(String[] args) throws IOException {
|
||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||
true,
|
||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
);
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
|
||
## Example: stemming with `US_UK_PROFI`
|
||
|
||
```java
|
||
import java.io.IOException;
|
||
|
||
import org.egothor.stemmer.*;
|
||
|
||
public final class EnglishExample {
|
||
|
||
public static void main(String[] args) throws IOException {
|
||
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||
true,
|
||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
);
|
||
|
||
String word = "running";
|
||
String patch = trie.get(word);
|
||
String stem = PatchCommandEncoder.apply(word, patch);
|
||
|
||
System.out.println(word + " -> " + stem);
|
||
}
|
||
}
|
||
```
|
||
|
||
|
||
|
||
## `US_UK` vs `US_UK_PROFI`
|
||
|
||
### `US_UK`
|
||
|
||
* smaller dictionary
|
||
* faster load time
|
||
* suitable for lightweight use cases
|
||
|
||
### `US_UK_PROFI`
|
||
|
||
* larger and more complete dataset
|
||
* better coverage of word forms
|
||
* improved stemming quality
|
||
* slightly larger memory footprint
|
||
|
||
### Recommendation
|
||
|
||
Use:
|
||
|
||
```
|
||
US_UK_PROFI
|
||
```
|
||
|
||
for most applications unless memory constraints are strict.
|
||
|
||
|
||
|
||
## How bundled dictionaries are loaded
|
||
|
||
Internally:
|
||
|
||
- dictionaries are stored as text resources
|
||
- parsed using `StemmerDictionaryParser`
|
||
- compiled into a trie at load time
|
||
|
||
This means:
|
||
|
||
- first load includes parsing + compilation cost
|
||
- subsequent usage is fast
|
||
|
||
|
||
|
||
## When to use bundled languages
|
||
|
||
Bundled dictionaries are suitable when:
|
||
|
||
- you need quick results without preparing custom data
|
||
- you are prototyping or experimenting
|
||
- your language requirements match the provided datasets
|
||
|
||
|
||
|
||
## When to use custom dictionaries
|
||
|
||
You should prefer custom dictionaries when:
|
||
|
||
- domain-specific vocabulary is important
|
||
- accuracy requirements are high
|
||
- you need full control over stemming behavior
|
||
|
||
Typical examples:
|
||
|
||
- technical terminology
|
||
- product catalogs
|
||
- biomedical text
|
||
- legal or financial language
|
||
|
||
|
||
|
||
## Production recommendation
|
||
|
||
For production systems:
|
||
|
||
1. Load a bundled dictionary
|
||
2. Extend it with domain-specific terms (optional)
|
||
3. Compile it into a binary `.radixor.gz` file
|
||
4. Deploy the compiled artifact
|
||
5. Load it using `loadBinary(...)`
|
||
|
||
This avoids:
|
||
|
||
- runtime parsing overhead
|
||
- repeated compilation
|
||
- startup latency
|
||
|
||
|
||
|
||
## Example workflow
|
||
|
||
```java
|
||
// 1. Load bundled dictionary
|
||
FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
|
||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||
true,
|
||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
);
|
||
|
||
// 2. Modify (optional)
|
||
FrequencyTrie.Builder<String> builder =
|
||
FrequencyTrieBuilders.copyOf(
|
||
base,
|
||
String[]::new,
|
||
ReductionSettings.withDefaults(
|
||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
|
||
)
|
||
);
|
||
|
||
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
|
||
|
||
// 3. Compile
|
||
FrequencyTrie<String> compiled = builder.build();
|
||
|
||
// 4. Save
|
||
StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
|
||
```
|
||
|
||
|
||
|
||
## Limitations
|
||
|
||
* bundled dictionaries are **general-purpose**
|
||
* they may not reflect:
|
||
|
||
* domain-specific usage
|
||
* rare or specialized vocabulary
|
||
* organization-specific terminology
|
||
|
||
|
||
|
||
## Next steps
|
||
|
||
* [Quick start](quick-start.md)
|
||
* [Dictionary format](dictionary-format.md)
|
||
* [CLI compilation](cli-compilation.md)
|
||
* [Programmatic usage](programmatic-usage.md)
|
||
|
||
|
||
|
||
## Summary
|
||
|
||
Radixor’s built-in language support provides:
|
||
|
||
* immediate usability
|
||
* reference datasets
|
||
* a starting point for customization
|
||
|
||
For production systems, they are best used as:
|
||
|
||
* a baseline
|
||
* a seed for further extension
|
||
* a source for compiled deployment artifacts
|
||
|