2026-04-26 13:14:14 +02:00
2026-04-26 13:14:14 +02:00
2026-04-12 12:54:09 +02:00
2026-04-12 13:15:27 +02:00
2026-04-12 12:54:09 +02:00
2026-04-12 12:54:09 +02:00

Radixor logo

Radixor

Quality gates Coverage Published reports Mutation score English benchmark Maven Central License Java

Fast, deterministic, multi-language stemming for Java, built around compact patch-command tries and measured at roughly 4× to 6× the throughput of the Snowball Porter stemmer family on the current English benchmark workload.

Radixor is a modern multi-language stemming toolkit for Java in the tradition of the original Egothor approach. It learns compact word-to-stem transformations from dictionary data, stores them in compiled patch-command tries, and exposes a runtime model designed for speed, determinism, and operational simplicity. Unlike a closed-form dictionary lookup stemmer, Radixor can also generalize beyond explicitly listed word forms.

It is particularly well suited to systems that need stemming which is:

  • fast at runtime,
  • compact in memory and on disk,
  • deterministic in behavior,
  • adaptable through dictionary data rather than hardcoded language rules,
  • practical to compile, persist, version, extend, and deploy.

It also retains the operational advantages of a compiled artifact model: predictable runtime behavior, direct binary loading, and clear separation between preparation-time compilation and live request processing.

Table of Contents

Why Radixor

The central idea behind Radixor is simple: learn how to transform a word form into its stem, encode that transformation as a compact patch command, store it in a trie, and make the runtime path as small and direct as possible.

That produces a stemmer that is:

  • data-driven rather than rule-hardcoded,
  • applicable across languages through compiled transformation models learned from dictionary data,
  • compact enough for deployment-friendly binary artifacts,
  • suitable for both offline compilation and direct runtime loading,
  • capable of exposing either a preferred result or multiple candidate results when ambiguity matters.

Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer.

Performance

Radixor includes a JMH benchmark suite for both its own algorithmic core and a side-by-side English comparison against the Snowball Porter stemmer family.

On the current English comparison workload, Radixor with bundled US_UK reaches approximately 31 to 32 million tokens per second. Snowball original Porter reaches approximately 8 million tokens per second, and Snowball English (Porter2) approximately 5 to 5.5 million tokens per second.

That places Radixor at approximately:

  • 4× the throughput of Snowball original Porter
  • 6× the throughput of Snowball English (Porter2)

on the current benchmark workload.

This is a throughput comparison on the same deterministic token stream. It is not a claim that the compared stemmers are linguistically equivalent or interchangeable.

For benchmark scope, workload design, environment, commands, report locations, and interpretation guidance, see Benchmarking.

Heritage

Radixor stands in the line of the original Egothor stemming work and its later Stempel packaging.

Historical Stempel documentation describes the stemmer code as taken virtually unchanged from the Egothor project, and Elasticsearch still documents the Stempel analysis plugin as integrating Lucenes Stempel module for Polish.

Useful historical references:

The Galambos paper is a useful historical reference for the semi-automatic, transformation-based stemming idea that later informed the Egothor lineage and, in turn, the conceptual background of Radixor. It should be read as research and heritage context rather than as a description of Radixor's present-day implementation.

Radixor is not a repackaging of legacy code. It is a modern implementation that preserves the valuable core idea while reworking the engineering around maintainability, testing, persistence, and long-term operational use.

What Radixor adds

Radixor keeps the patch-command trie model, but improves the engineering around it in ways that matter in real software systems.

Compared with the historical baseline, Radixor emphasizes:

  • a focused practical core
    The implementation concentrates on the parts of the original approach that are most useful in production.

  • immutable compiled tries
    Runtime lookup uses compact read-only structures optimized for efficient access.

  • support for more than one stemming result
    Radixor can expose both a preferred result and multiple candidate results when the underlying data is ambiguous.

  • frequency-aware deterministic ordering
    Candidate results are ordered consistently and reproducibly.

  • practical subtree reduction modes
    Reduction can be tuned toward stronger compression or more conservative semantic preservation.

  • reconstruction of writable builders from compiled artifacts
    Existing compiled stemmer tables can be reopened, modified, and compiled again.

  • strong validation discipline
    Coverage, mutation testing, benchmark visibility, and published reports are treated as part of the engineering standard rather than optional project decoration.

Key features

  • Fast algorithmic stemming
  • Compact compiled binary artifacts
  • Patch-command based transformation model
  • Multi-language stemming through compiled transformation models
  • Single-result and multi-result lookup
  • Deterministic result ordering
  • Compressed binary persistence
  • Programmatic compilation and loading
  • CLI compilation tool
  • Bundled language resources
  • Support for extending compiled stemmer tables
  • Reproducible and auditable engineering posture

Documentation

The repository keeps the front page concise and places detailed documentation under docs/.

Getting Started

Programmatic Usage

Concepts and Internals

Dictionaries and Language Resources

Quality and Operations

  • Quality and Operations
    Engineering standards, validation posture, auditability, and operational model.

  • Benchmarking
    JMH benchmark methodology, Porter comparison, and result interpretation.

  • Published Reports
    Entry points to CI-published reports and GitHub Pages artifacts.

Project philosophy

Radixor does not preserve historical complexity for its own sake.

It preserves the valuable idea:

  • compact learned transformations,
  • trie-based lookup,
  • language-data driven stemming,
  • practical runtime speed.

Then it improves the parts modern users care about:

  • maintainability,
  • testability,
  • modification workflows,
  • persistence,
  • determinism,
  • clearer APIs,
  • explicit quality evidence.

The goal is to keep the Egothor/Stempel lineage useful as a serious contemporary software component.

Historical note

Egothor showed that stemming could be both algorithmic and compact. Stempel proved that the approach was practical enough to survive inside major search ecosystems. Radixor continues that tradition with a modernized implementation focused on production use, maintainability, and controlled evolution.

Description
A multilingual stemming engine by Egothor.
Readme BSD-3-Clause 42 MiB
Languages
Java 97.6%
Shell 1.4%
Python 1%