Radixor/docs/built-in-languages.md

# Built-in Languages

Radixor provides a set of bundled stemmer dictionaries that can be loaded directly without preparing custom lexical data first.

These resources are intended as practical default dictionaries for common use. They provide a solid starting point for evaluation, integration, and general-purpose stemming workloads, while still fitting naturally into workflows where the bundled baseline is later refined, extended, or replaced by a custom dictionary.

## Overview

Bundled dictionaries are exposed through:

```java
StemmerPatchTrieLoader.Language
```

They are packaged with the library as text resources and compiled into a `FrequencyTrie<String>` when loaded.

## Supported languages

The following bundled language identifiers are currently available:

| Language | Enum constant | Notes |
|---|---|---|
| Danish | `DA_DK` | Bundled general-purpose dictionary |
| German | `DE_DE` | Bundled general-purpose dictionary |
| Spanish | `ES_ES` | Bundled general-purpose dictionary |
| French | `FR_FR` | Bundled general-purpose dictionary |
| Italian | `IT_IT` | Bundled general-purpose dictionary |
| Dutch | `NL_NL` | Bundled general-purpose dictionary |
| Norwegian | `NO_NO` | Bundled general-purpose dictionary |
| Portuguese | `PT_PT` | Bundled general-purpose dictionary |
| Russian | `RU_RU` | Currently supplied in normalized transliterated form |
| Swedish | `SV_SE` | Bundled general-purpose dictionary |
| English | `US_UK` | Standard English dictionary |
| English | `US_UK_PROFI` | Extended English dictionary |

## Basic usage

Load a bundled stemmer like this:

```java
import java.io.IOException;

import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.StemmerPatchTrieLoader;

public final class BuiltInExample {

    private BuiltInExample() {
        throw new AssertionError("No instances.");
    }

    public static void main(final String[] arguments) throws IOException {
        final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
                StemmerPatchTrieLoader.Language.US_UK_PROFI,
                true,
                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
    }
}
```

The loader reads the bundled dictionary resource, parses the textual entries, derives patch-command mappings, and compiles the result into a read-only trie.

## Example: stemming with `US_UK_PROFI`

```java
import java.io.IOException;

import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.PatchCommandEncoder;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.StemmerPatchTrieLoader;

public final class EnglishExample {

    private EnglishExample() {
        throw new AssertionError("No instances.");
    }

    public static void main(final String[] arguments) throws IOException {
        final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
                StemmerPatchTrieLoader.Language.US_UK_PROFI,
                true,
                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);

        final String word = "running";
        final String patch = trie.get(word);
        final String stem = PatchCommandEncoder.apply(word, patch);

        System.out.println(word + " -> " + stem);
    }
}
```

## `US_UK` and `US_UK_PROFI`

Radixor currently provides two bundled English variants.

### `US_UK`

`US_UK` is the lighter-weight bundled English resource. It is suitable where a smaller default dictionary is preferred and maximal lexical coverage is not the primary goal.

### `US_UK_PROFI`

`US_UK_PROFI` is the more extensive bundled English resource. It offers broader lexical coverage and is the better default for most applications that want stronger out-of-the-box behavior.

### Recommendation

For most English-language deployments, prefer:

```text
US_UK_PROFI
```

Use `US_UK` when a smaller bundled baseline is more appropriate.

## Intended role of bundled dictionaries

Bundled dictionaries should be understood as **general-purpose default resources**.

They are a good fit when:

- a supported language is already available,
- immediate usability matters,
- a reasonable baseline is sufficient,
- the goal is evaluation, prototyping, or straightforward integration.

They are also well suited to staged refinement workflows in which the bundled base is loaded first, then extended with domain-specific vocabulary, and finally persisted as a custom binary artifact.

## Character representation

The current bundled resources follow a pragmatic normalization convention.

At present, bundled dictionaries are supplied in normalized plain-ASCII form. For some languages, this is simply a lightweight maintenance convention. For others, especially languages commonly written in another script, it reflects a transliterated lexical resource. Russian is the clearest example in the current bundled set.

This convention belongs to the supplied dictionary resources, not to the core stemming model. The parser reads UTF-8 text, the dictionary model works with ordinary Java strings, and the trie and patch-command mechanism operate on general character sequences. In practical terms, the architecture is compatible with native-script dictionaries when suitable lexical resources are available.

## When to prefer custom dictionaries

A custom dictionary is usually the better choice when:

- domain-specific vocabulary materially affects stemming quality,
- lexical coverage must be controlled more precisely,
- a stronger language resource is available than the bundled baseline,
- native-script support is needed beyond the currently bundled resources.

Typical examples include:

- technical terminology,
- biomedical language,
- legal or financial vocabulary,
- organization-specific product and process names,
- language resources maintained in native scripts.

## Production recommendation

For production systems, the most robust workflow is usually:

1. start from a bundled dictionary when it is suitable,
2. extend it with domain-specific forms if needed,
3. compile or rebuild it into a binary `.radixor.gz` artifact,
4. deploy that compiled artifact,
5. load it at runtime using `loadBinary(...)`.

This avoids repeated startup parsing and makes the deployed stemming behavior explicit and versionable.

## Example refinement workflow

```java
import java.io.IOException;
import java.nio.file.Path;

import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.FrequencyTrieBuilders;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.ReductionSettings;
import org.egothor.stemmer.StemmerPatchTrieBinaryIO;
import org.egothor.stemmer.StemmerPatchTrieLoader;

public final class BundledRefinementExample {

    private BundledRefinementExample() {
        throw new AssertionError("No instances.");
    }

    public static void main(final String[] arguments) throws IOException {
        final FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
                StemmerPatchTrieLoader.Language.US_UK_PROFI,
                true,
                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);

        final FrequencyTrie.Builder<String> builder = FrequencyTrieBuilders.copyOf(
                base,
                String[]::new,
                ReductionSettings.withDefaults(
                        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS));

        builder.put("microservices", "Na");

        final FrequencyTrie<String> compiled = builder.build();

        StemmerPatchTrieBinaryIO.write(compiled, Path.of("english-custom.radixor.gz"));
    }
}
```

## Extending language support

The built-in set is intentionally a practical baseline rather than a closed catalog. High-quality dictionaries for additional languages, improved language coverage, and stronger native-script resources are all natural extension paths for the project.

What matters most is not only the number of entries, but the quality, consistency, and operational usefulness of the lexical resource being added.

## Next steps

- [Quick start](quick-start.md)
- [Dictionary format](dictionary-format.md)
- [CLI compilation](cli-compilation.md)
- [Programmatic usage](programmatic-usage.md)

## Summary

Radixor’s built-in language support provides immediate usability, practical default dictionaries, and a strong starting point for custom refinement. The current bundled resources follow a pragmatic normalization convention, while the underlying architecture remains well suited to richer language resources and future extensions.