Files
Radixor/docs/programmatic-usage.md

6.5 KiB
Raw Blame History

Programmatic Usage

This document describes how to use Radixor programmatically from Java.

It covers:

  • building a trie from dictionary data
  • compiling it into an immutable structure
  • loading compiled stemmers
  • querying for stems
  • working with multiple candidates
  • modifying existing compiled stemmers

Overview

Radixor separates the stemming lifecycle into three stages:

  1. Build collect wordstem mappings in a mutable structure
  2. Compile reduce and convert to an immutable trie
  3. Query perform fast runtime lookups

These stages are represented by:

  • FrequencyTrie.Builder (mutable)
  • FrequencyTrie (immutable, compiled)
  • StemmerPatchTrieLoader / StemmerPatchTrieBinaryIO (I/O)

Building a trie programmatically

You can construct a trie directly without using the CLI.

import org.egothor.stemmer.*;

public final class BuildExample {

    public static void main(String[] args) {
        ReductionSettings settings = ReductionSettings.withDefaults(
                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
        );

        FrequencyTrie.Builder<String> builder =
                new FrequencyTrie.Builder<>(String[]::new, settings);

        PatchCommandEncoder encoder = new PatchCommandEncoder();

        builder.put("running", encoder.encode("running", "run"));
        builder.put("runs", encoder.encode("runs", "run"));
        builder.put("ran", encoder.encode("ran", "run"));

        FrequencyTrie<String> trie = builder.build();
    }
}

Loading from dictionary files

To parse dictionary files directly:

import java.io.IOException;
import java.nio.file.Path;

import org.egothor.stemmer.*;

public final class LoadFromDictionaryExample {

    public static void main(String[] args) throws IOException {
        FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
                Path.of("data/stemmer.txt"),
                true,
                ReductionSettings.withDefaults(
                        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
                )
        );
    }
}

Loading a compiled binary trie

import java.io.IOException;
import java.nio.file.Path;

import org.egothor.stemmer.*;

public final class LoadBinaryExample {

    public static void main(String[] args) throws IOException {
        FrequencyTrie<String> trie =
                StemmerPatchTrieLoader.loadBinary(Path.of("english.radixor.gz"));
    }
}

This is the preferred production approach.

Querying for stems

Preferred result

String word = "running";
String patch = trie.get(word);
String stem = PatchCommandEncoder.apply(word, patch);

All candidates

String[] patches = trie.getAll(word);

for (String patch : patches) {
    String stem = PatchCommandEncoder.apply(word, patch);
}

Accessing value frequencies

For diagnostic or advanced use cases:

import org.egothor.stemmer.ValueCount;

java.util.List<ValueCount<String>> entries = trie.getEntries("axes");

for (ValueCount<String> entry : entries) {
    String patch = entry.value();
    int count = entry.count();
}

This allows:

  • inspecting ambiguity
  • understanding ranking decisions
  • debugging dictionary quality

Using bundled language resources

FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
        StemmerPatchTrieLoader.Language.US_UK_PROFI,
        true,
        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);

Bundled dictionaries are useful for:

  • quick integration
  • testing
  • reference behavior

Persisting a compiled trie

import java.io.IOException;
import java.nio.file.Path;

import org.egothor.stemmer.*;

public final class SaveExample {

    public static void main(String[] args) throws IOException {
        StemmerPatchTrieBinaryIO.write(trie, Path.of("english.radixor.gz"));
    }
}

Modifying an existing trie

A compiled trie can be reopened into a builder, extended, and rebuilt.

import java.io.IOException;
import java.nio.file.Path;

import org.egothor.stemmer.*;

public final class ModifyExample {

    public static void main(String[] args) throws IOException {
        FrequencyTrie<String> compiled =
                StemmerPatchTrieBinaryIO.read(Path.of("english.radixor.gz"));

        ReductionSettings settings = ReductionSettings.withDefaults(
                ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
        );

        FrequencyTrie.Builder<String> builder =
                FrequencyTrieBuilders.copyOf(compiled, String[]::new, settings);

        builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);

        FrequencyTrie<String> updated = builder.build();

        StemmerPatchTrieBinaryIO.write(updated,
                Path.of("english-custom.radixor.gz"));
    }
}

Thread safety

  • FrequencyTrie (compiled):

    • thread-safe
    • safe for concurrent reads
  • FrequencyTrie.Builder:

    • not thread-safe
    • intended for single-threaded construction

Performance characteristics

Querying

  • O(length of word)
  • minimal allocations
  • suitable for high-throughput pipelines

Loading

  • binary loading is fast
  • no preprocessing required

Building

  • depends on dictionary size
  • reduction phase may be CPU-intensive

Best practices

Reuse compiled trie instances

  • load once
  • share across threads

Prefer binary loading in production

  • avoid rebuilding at runtime
  • treat compiled files as deployable artifacts

Use getAll() only when needed

  • get() is faster and sufficient for most use cases

Keep builders short-lived

  • build → compile → discard

Integration patterns

Search systems

  • apply stemming during indexing and querying
  • ensure consistent dictionary usage

Text normalization pipelines

  • integrate as a transformation step
  • combine with tokenization and filtering

Domain adaptation

  • extend dictionaries with domain-specific vocabulary
  • rebuild compiled artifacts

Next steps

Summary

Programmatic usage of Radixor follows a clear pattern:

  • build or load a trie
  • query using patch commands
  • apply transformations

The API is intentionally simple at the surface, while providing deeper control when needed for:

  • ambiguity handling
  • diagnostics
  • dictionary evolution