docs: improve README, MkDocs content, branding assets, and site polish

This commit is contained in:
2026-04-19 00:18:42 +02:00
parent db79dd2d4f
commit 0b674a39a8
19 changed files with 1836 additions and 1698 deletions

182
README.md
View File

@@ -11,53 +11,61 @@
[![License](https://img.shields.io/github/license/leogalambos/Radixor)](LICENSE)
[![Java](https://img.shields.io/badge/Java-21%2B-brightgreen)](#)
*Fast algorithmic stemming with compact patch-command tries measured at about 4× to 6× the throughput of the Snowball Porter stemmer family on the current English benchmark workload.*
*Fast, deterministic, multi-language stemming for Java, built around compact patch-command tries and measured at roughly 4× to 6× the throughput of the Snowball Porter stemmer family on the current English benchmark workload.*
**Radixor** is a fast, algorithmic stemming toolkit for Java, built around compact **patch-command tries** in the tradition of the original **Egothor** stemmer.
**Radixor** is a modern multi-language stemming toolkit for Java in the tradition of the original **Egothor** approach. It learns compact word-to-stem transformations from dictionary data, stores them in compiled patch-command tries, and exposes a runtime model designed for speed, determinism, and operational simplicity. Unlike a closed-form dictionary lookup stemmer, Radixor can also generalize beyond explicitly listed word forms.
On the current JMH English comparison benchmark, Radixor with bundled `US_UK_PROFI`
reaches approximately **31 to 32 million tokens per second**, compared with about
**8 million tokens per second** for Snowball original Porter and about
**5 to 5.5 million tokens per second** for Snowball English (Porter2).
It is particularly well suited to systems that need stemming which is:
That means the current Radixor implementation is approximately:
- fast at runtime,
- compact in memory and on disk,
- deterministic in behavior,
- adaptable through dictionary data rather than hardcoded language rules,
- practical to compile, persist, version, extend, and deploy.
- **4× faster** than Snowball original Porter
- **6× faster** than Snowball English (Porter2)
It is designed for production search and text-processing systems that need stemming which is:
- fast at runtime
- compact in memory and on disk
- deterministic in behavior
- driven by dictionary data rather than hardcoded language rules
- practical to maintain, extend, and test
Radixor keeps the valuable core of the original Egothor idea, modernizes the implementation, and adds capabilities that make it more useful in real software systems today.
It also retains the operational advantages of a compiled artifact model: predictable runtime behavior, direct binary loading, and clear separation between preparation-time compilation and live request processing.
## Table of Contents
- [Why Radixor](#why-radixor)
- [Performance](#performance)
- [Heritage](#heritage)
- [What Radixor adds](#what-radixor-adds)
- [Key features](#key-features)
- [Performance](#performance)
- [Documentation](#documentation)
- [Project philosophy](#project-philosophy)
- [Historical note](#historical-note)
## Why Radixor
The central idea behind Radixor is simple: learn how to transform a word form into its stem, encode that transformation as a compact patch command, store it in a trie, and make runtime lookup extremely fast.
The central idea behind Radixor is simple: learn how to transform a word form into its stem, encode that transformation as a compact patch command, store it in a trie, and make the runtime path as small and direct as possible.
This gives you a stemmer that is:
That produces a stemmer that is:
- data-driven rather than rule-hardcoded
- reusable across languages
- compact enough for deployment-friendly binary artifacts
- suitable for both offline compilation and runtime loading
- data-driven rather than rule-hardcoded,
- applicable across languages through compiled transformation models learned from dictionary data,
- compact enough for deployment-friendly binary artifacts,
- suitable for both offline compilation and direct runtime loading,
- capable of exposing either a preferred result or multiple candidate results when ambiguity matters.
Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer. In the current English benchmark comparison against the Snowball Porter stemmer family, it also delivers a substantial throughput advantage.
Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer.
## Performance
Radixor includes a JMH benchmark suite for both its own algorithmic core and a side-by-side English comparison against the Snowball Porter stemmer family.
On the current English comparison workload, Radixor with bundled `US_UK_PROFI` reaches approximately **31 to 32 million tokens per second**. Snowball original Porter reaches approximately **8 million tokens per second**, and Snowball English (Porter2) approximately **5 to 5.5 million tokens per second**.
That places Radixor at approximately:
- **4× the throughput of Snowball original Porter**
- **6× the throughput of Snowball English (Porter2)**
on the current benchmark workload.
This is a throughput comparison on the same deterministic token stream. It is **not** a claim that the compared stemmers are linguistically equivalent or interchangeable.
For benchmark scope, workload design, environment, commands, report locations, and interpretation guidance, see [Benchmarking](docs/benchmarking.md).
## Heritage
@@ -69,44 +77,47 @@ Useful historical references:
- [Egothor project](http://www.egothor.org/)
- [Stempel overview](https://www.getopt.org/stempel/)
- [Leo Galambos, *Lemmatizer for Document Information Retrieval Systems in JAVA* (SOFSEM 2001)](https://www.researchgate.net/publication/221512865_Lemmatizer_for_Document_Information_Retrieval_Systems_in_JAVA)
- [Lucene Stempel overview](https://lucene.apache.org/core/5_3_0/analyzers-stempel/index.html)
- [Elasticsearch Stempel plugin](https://www.elastic.co/docs/reference/elasticsearch/plugins/analysis-stempel)
Radixor is not just a repackaging of legacy code. It is a practical modernization of the approach for current Java development and long-term maintainability.
The Galambos paper is a useful historical reference for the semi-automatic, transformation-based stemming idea that later informed the Egothor lineage and, in turn, the conceptual background of Radixor. It should be read as research and heritage context rather than as a description of Radixor's present-day implementation.
Radixor is not a repackaging of legacy code. It is a modern implementation that preserves the valuable core idea while reworking the engineering around maintainability, testing, persistence, and long-term operational use.
## What Radixor adds
Radixor keeps the patch-command trie model, but improves the engineering around it.
Radixor keeps the patch-command trie model, but improves the engineering around it in ways that matter in real software systems.
Compared with the historical baseline, Radixor emphasizes:
- **simplification to the most practical core**
The implementation focuses on the parts of the original approach that are most useful in production.
- **a focused practical core**
The implementation concentrates on the parts of the original approach that are most useful in production.
- **immutable compiled tries**
Runtime lookup uses compact read-only structures optimized for efficient access.
- **support for more than one stemming result**
Radixor can expose both a preferred result and multiple candidate results where the data is ambiguous.
Radixor can expose both a preferred result and multiple candidate results when the underlying data is ambiguous.
- **frequency-aware deterministic ordering**
Candidate results are ordered consistently and reproducibly.
- **practical subtree reduction modes**
Reduction can be tuned toward stronger compression or more conservative behavioral preservation.
Reduction can be tuned toward stronger compression or more conservative semantic preservation.
- **reconstruction of writable builders from compiled tables**
- **reconstruction of writable builders from compiled artifacts**
Existing compiled stemmer tables can be reopened, modified, and compiled again.
- **better tests and implementation stability**
Stronger coverage improves confidence during refactoring and further development.
- **strong validation discipline**
Coverage, mutation testing, benchmark visibility, and published reports are treated as part of the engineering standard rather than optional project decoration.
## Key features
- Fast algorithmic stemming
- Compact compiled binary artifacts
- Patch-command based transformation model
- Dictionary-driven language adaptation
- Multi-language stemming through compiled transformation models
- Single-result and multi-result lookup
- Deterministic result ordering
- Compressed binary persistence
@@ -114,57 +125,69 @@ Compared with the historical baseline, Radixor emphasizes:
- CLI compilation tool
- Bundled language resources
- Support for extending compiled stemmer tables
## Performance
Radixor includes a JMH benchmark suite for both its own algorithmic core and a
side-by-side comparison against the Snowball Porter stemmer family.
On the current English comparison workload, Radixor with bundled `US_UK_PROFI`
reaches approximately **31 to 32 million tokens per second**. Snowball original
Porter reaches approximately **8 million tokens per second**, and Snowball
English (Porter2) approximately **5 to 5.5 million tokens per second**.
That places Radixor at approximately **4× the throughput of Snowball original Porter**
and approximately **6× the throughput of Snowball English (Porter2)**
on the current benchmark workload.
This is a throughput comparison on the same deterministic token stream. It is
not a claim that the compared stemmers are linguistically equivalent or
interchangeable.
For benchmark scope, workload design, environment, commands, report locations,
and interpretation guidance, see [Benchmarking](docs/benchmarking.md).
- Reproducible and auditable engineering posture
## Documentation
The repository keeps the front page concise and places detailed documentation under `docs/`.
Start here:
### Getting Started
- [Quick Start](docs/quick-start.md)
A practical first guide to loading, compiling, and using Radixor.
- [Built-in Languages](docs/built-in-languages.md)
Overview of bundled language resources such as `US_UK` and `US_UK_PROFI`.
- [Dictionary Format](docs/dictionary-format.md)
How to write stemming dictionaries.
How to write and normalize stemming dictionaries.
- [Compilation (CLI tool)](docs/cli-compilation.md)
How to compile dictionaries with the `Compile` CLI.
How to compile dictionaries into deployable binary artifacts.
- [Programmatic Usage](docs/programmatic-usage.md)
How to build, load, modify, and query Radixor from Java code.
### Programmatic Usage
- [Built-in Languages](docs/built-in-languages.md)
How to use integrated language resources such as `US_UK_PROFI`.
- [Programmatic Usage Overview](docs/programmatic-usage.md)
Entry point to the Java API and the overall usage model.
- [Architecture and Reduction](docs/architecture-and-reduction.md)
Internal model, compiled trie design, and reduction strategies.
- [Loading and Building Stemmers](docs/programmatic-loading-and-building.md)
Loading bundled resources, textual dictionaries, binary artifacts, and direct builder usage.
- [Querying and Ambiguity Handling](docs/programmatic-querying-and-ambiguity.md)
`get()`, `getAll()`, `getEntries()`, patch application, and ambiguity behavior.
- [Extending and Persisting Compiled Tries](docs/programmatic-extending-and-persistence.md)
Reopening compiled tries, rebuilding them, and writing binary artifacts.
### Concepts and Internals
- [Architecture and Reduction Overview](docs/architecture-and-reduction.md)
High-level explanation of the build pipeline and compiled trie model.
- [Architecture](docs/architecture.md)
Structural model, data flow, and runtime lookup behavior.
- [Reduction Semantics](docs/reduction-semantics.md)
Ranked, unordered, and dominant reduction behavior.
- [Compatibility and Guarantees](docs/compatibility-and-guarantees.md)
Supported public API, internal API boundaries, and compatibility expectations.
### Dictionaries and Language Resources
- [Contributing Dictionaries](docs/contributing-dictionaries.md)
Guidance for high-quality lexical resource contributions.
### Quality and Operations
- [Quality and Operations](docs/quality-and-operations.md)
Testing, persistence, deployment, and operational guidance.
Engineering standards, validation posture, auditability, and operational model.
- [Benchmarking](docs/benchmarking.md)
JMH benchmark design, Snowball comparison, execution, and interpretation.
JMH benchmark methodology, Porter comparison, and result interpretation.
- [Published Reports](docs/reports.md)
Entry points to CI-published reports and GitHub Pages artifacts.
## Project philosophy
@@ -172,19 +195,20 @@ Radixor does not preserve historical complexity for its own sake.
It preserves the valuable idea:
- compact learned transformations
- trie-based lookup
- language-data driven stemming
- practical runtime speed
- compact learned transformations,
- trie-based lookup,
- language-data driven stemming,
- practical runtime speed.
Then it improves the parts modern users care about:
- maintainability
- testability
- modification workflows
- persistence
- determinism
- clearer APIs
- maintainability,
- testability,
- modification workflows,
- persistence,
- determinism,
- clearer APIs,
- explicit quality evidence.
The goal is to keep the Egothor/Stempel lineage useful as a serious contemporary software component.