diff --git a/.classpath b/.classpath
index e3393aa..a079cbf 100644
--- a/.classpath
+++ b/.classpath
@@ -26,6 +26,13 @@
+
+
+
+
+
+
+
diff --git a/README.md b/README.md
index b775373..d6e3cde 100644
--- a/README.md
+++ b/README.md
@@ -2,10 +2,20 @@
# Radixor
-*Fast algorithmic stemming with compact patch-command tries.*
+*Fast algorithmic stemming with compact patch-command tries — measured at about 4× to 6× the throughput of the Snowball Porter stemmer family on the current English benchmark workload.*
**Radixor** is a fast, algorithmic stemming toolkit for Java, built around compact **patch-command tries** in the tradition of the original **Egothor** stemmer.
+On the current JMH English comparison benchmark, Radixor with bundled `US_UK_PROFI`
+reaches approximately **31 to 32 million tokens per second**, compared with about
+**8 million tokens per second** for Snowball original Porter and about
+**5 to 5.5 million tokens per second** for Snowball English (Porter2).
+
+That means the current Radixor implementation is approximately:
+
+- **4× faster** than Snowball original Porter
+- **6× faster** than Snowball English (Porter2)
+
It is designed for production search and text-processing systems that need stemming which is:
- fast at runtime
@@ -22,6 +32,7 @@ Radixor keeps the valuable core of the original Egothor idea, modernizes the imp
- [Heritage](#heritage)
- [What Radixor adds](#what-radixor-adds)
- [Key features](#key-features)
+- [Performance](#performance)
- [Documentation](#documentation)
- [Project philosophy](#project-philosophy)
- [Historical note](#historical-note)
@@ -37,7 +48,7 @@ This gives you a stemmer that is:
- compact enough for deployment-friendly binary artifacts
- suitable for both offline compilation and runtime loading
-Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer.
+Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer. In the current English benchmark comparison against the Snowball Porter stemmer family, it also delivers a substantial throughput advantage.
## Heritage
@@ -95,6 +106,27 @@ Compared with the historical baseline, Radixor emphasizes:
- Bundled language resources
- Support for extending compiled stemmer tables
+## Performance
+
+Radixor includes a JMH benchmark suite for both its own algorithmic core and a
+side-by-side comparison against the Snowball Porter stemmer family.
+
+On the current English comparison workload, Radixor with bundled `US_UK_PROFI`
+reaches approximately **31 to 32 million tokens per second**. Snowball original
+Porter reaches approximately **8 million tokens per second**, and Snowball
+English (Porter2) approximately **5 to 5.5 million tokens per second**.
+
+That places Radixor at approximately **4× the throughput of Snowball original Porter**
+and approximately **6× the throughput of Snowball English (Porter2)**
+on the current benchmark workload.
+
+This is a throughput comparison on the same deterministic token stream. It is
+not a claim that the compared stemmers are linguistically equivalent or
+interchangeable.
+
+For benchmark scope, workload design, environment, commands, report locations,
+and interpretation guidance, see [Benchmarking](docs/benchmarking.md).
+
## Documentation
The repository keeps the front page concise and places detailed documentation under `docs/`.
@@ -122,6 +154,9 @@ Start here:
- [Quality and Operations](docs/quality-and-operations.md)
Testing, persistence, deployment, and operational guidance.
+- [Benchmarking](docs/benchmarking.md)
+ JMH benchmark design, Snowball comparison, execution, and interpretation.
+
## Project philosophy
Radixor does not preserve historical complexity for its own sake.
diff --git a/Radixor.png b/Radixor.png
index 23ff4d0..0aad67b 100644
Binary files a/Radixor.png and b/Radixor.png differ
diff --git a/build.gradle b/build.gradle
index 4b40b17..fda5b23 100644
--- a/build.gradle
+++ b/build.gradle
@@ -125,7 +125,7 @@ jmh {
tasks.named('jmh') {
group = 'verification'
- description = 'Runs JMH benchmarks for the Radixor algorithmic core.'
+ description = 'Runs JMH benchmarks for the Radixor algorithmic core and Snowball comparison suite.'
}
javadoc {
@@ -154,6 +154,8 @@ javadoc {
source = sourceSets.main.allJava
}
+apply from: 'gradle/snowball-benchmarks.gradle'
+
gradle.taskGraph.whenReady { taskGraph ->
def banner = """
\u001B[34m
diff --git a/docs/benchmarking.md b/docs/benchmarking.md
new file mode 100644
index 0000000..9cf75e3
--- /dev/null
+++ b/docs/benchmarking.md
@@ -0,0 +1,134 @@
+# Benchmarking
+
+> ← Back to [README.md](../README.md)
+
+Radixor includes a JMH benchmark suite for both the internal algorithmic core and a side-by-side English comparison against the Snowball Porter stemmer family.
+
+This document explains what is benchmarked, how to run it, and how to interpret the results responsibly.
+
+## Scope
+
+The benchmark suite currently covers two categories:
+
+- Radixor core operations
+- English stemmer comparison on the same token workload
+
+The comparison benchmark processes the same deterministic English token stream through:
+
+- Radixor with bundled `US_UK_PROFI`
+- Snowball original Porter
+- Snowball English, commonly referred to as Porter2
+
+The purpose of the comparison is throughput measurement on identical input. It is not intended to prove linguistic equivalence between the compared stemmers.
+
+## Current snapshot
+
+A recent JMH run on JDK 21.0.10 with JMH 1.37, one thread, three warmup iterations, and five measurement iterations produced the following approximate throughput ranges:
+
+| Workload | Radixor `US_UK_PROFI` | Snowball Porter | Snowball English |
+| --- | ---: | ---: | ---: |
+| About 12,000 generated tokens | 30.99 M tokens/s | 8.21 M tokens/s | 5.46 M tokens/s |
+| About 60,000 generated tokens | 32.25 M tokens/s | 8.02 M tokens/s | 5.11 M tokens/s |
+
+On that workload, Radixor is approximately:
+
+- 4 times faster than Snowball original Porter
+- 6 times faster than Snowball English
+
+These values are workload- and environment-dependent. Treat them as measured results for the documented benchmark setup, not as universal constants.
+
+## Benchmark classes
+
+The main benchmark classes are under `src/jmh/java/org/egothor/stemmer/benchmark`.
+
+Relevant classes include:
+
+- `FrequencyTrieLookupBenchmark`
+- `FrequencyTrieCompilationBenchmark`
+- `EnglishStemmerComparisonBenchmark`
+
+The English comparison benchmark uses the bundled Radixor English resource and the official Snowball Java distribution integrated into the JMH source set.
+
+## Workload design
+
+The English comparison benchmark uses a deterministic generated corpus rather than an uncontrolled ad hoc text sample.
+
+The workload intentionally mixes:
+
+- simple inflections
+- common derivational forms
+- US and UK spelling families
+- lexical forms appropriate for `US_UK_PROFI`
+
+This design keeps runs reproducible across environments and avoids accidental drift caused by changing external corpora.
+
+## Running benchmarks
+
+Run the full benchmark suite:
+
+```bash
+./gradlew jmh
+```
+
+Run only the English comparison benchmark:
+
+```bash
+./gradlew jmh -Pjmh.includes=EnglishStemmerComparisonBenchmark
+```
+
+## Generated reports
+
+JMH reports are written to:
+
+- `build/reports/jmh/jmh-results.txt`
+- `build/reports/jmh/jmh-results.csv`
+
+The text report is convenient for human review. The CSV report is more useful for CI archiving, historical tracking, and external processing.
+
+## Interpreting results
+
+Benchmark numbers should be read with care.
+
+Important factors include:
+
+- CPU model and frequency behavior
+- thermal throttling
+- JVM vendor and version
+- system background load
+- operating-system scheduling noise
+- benchmark parameter changes
+
+For meaningful comparison, keep these stable:
+
+- hardware or VM class
+- JDK version
+- benchmark parameters
+- thread count
+- benchmark source revision
+
+If a regression is suspected, repeat the run and compare against the previous CSV output rather than relying on a single measurement.
+
+## Regression tracking
+
+The recommended regression workflow is:
+
+1. archive `jmh-results.csv`
+2. compare the same benchmark names across runs
+3. compare only like-for-like environments
+4. investigate sustained regressions rather than one-off noise
+
+For public reporting, the README should keep only the condensed benchmark summary, while detailed benchmark methodology and interpretation should remain in this document.
+
+## Notes on comparison fairness
+
+Radixor, Snowball Porter, and Snowball English are not the same kind of stemmer.
+
+Radixor uses a compiled patch-command trie driven by dictionary data. Snowball Porter and Snowball English are rule-based English stemmers.
+
+Because of that, the comparison should be understood as:
+
+- equal input workload
+- different stemming strategies
+- measured throughput, not semantic identity
+
+That distinction matters whenever performance claims are discussed in documentation or release notes.
diff --git a/gradle/snowball-benchmarks.gradle b/gradle/snowball-benchmarks.gradle
new file mode 100644
index 0000000..8e93fd5
--- /dev/null
+++ b/gradle/snowball-benchmarks.gradle
@@ -0,0 +1,49 @@
+def snowballVersion = '3.0.1'
+def snowballArchiveName = "libstemmer_java-${snowballVersion}.tar.gz"
+def snowballDownloadUrl = "https://snowballstem.org/dist/${snowballArchiveName}"
+def snowballDownloadFile = layout.buildDirectory.file("third-party/snowball/${snowballArchiveName}")
+def snowballExtractDirectory = layout.buildDirectory.dir('third-party/snowball/source')
+def snowballJavaSourceDirectory = layout.buildDirectory.dir(
+ "third-party/snowball/source/libstemmer_java-${snowballVersion}/java")
+
+tasks.register('downloadSnowballJava') {
+ group = 'build setup'
+ description = 'Downloads the official Snowball Java source distribution for benchmark-only use.'
+
+ outputs.file(snowballDownloadFile)
+
+ doLast {
+ File targetFile = snowballDownloadFile.get().asFile
+ targetFile.parentFile.mkdirs()
+
+ if (!targetFile.exists()) {
+ new URL(snowballDownloadUrl).withInputStream { inputStream ->
+ targetFile.withOutputStream { outputStream ->
+ outputStream << inputStream
+ }
+ }
+ }
+ }
+}
+
+tasks.register('extractSnowballJava', Copy) {
+ group = 'build setup'
+ description = 'Extracts the official Snowball Java source distribution.'
+
+ dependsOn(tasks.named('downloadSnowballJava'))
+
+ from(tarTree(resources.gzip(snowballDownloadFile)))
+ into(snowballExtractDirectory)
+}
+
+sourceSets {
+ jmh {
+ java {
+ srcDir(snowballJavaSourceDirectory)
+ }
+ }
+}
+
+tasks.named('compileJmhJava') {
+ dependsOn(tasks.named('extractSnowballJava'))
+}
\ No newline at end of file
diff --git a/src/jmh/java/org/egothor/stemmer/benchmark/EnglishComparisonCorpus.java b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishComparisonCorpus.java
new file mode 100644
index 0000000..67fa05d
--- /dev/null
+++ b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishComparisonCorpus.java
@@ -0,0 +1,110 @@
+package org.egothor.stemmer.benchmark;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Locale;
+
+/**
+ * Builds a deterministic English token corpus for side-by-side stemming
+ * benchmarks.
+ *
+ *
+ * The generated corpus mixes:
+ *
+ *
+ * - simple inflections
+ * - common derivational forms
+ * - US/UK spelling families
+ * - forms that are suitable for comparison against the bundled
+ * {@code US_UK_PROFI} Radixor dictionary
+ *
+ *
+ *
+ * The goal is not to simulate natural language frequency distribution exactly,
+ * but to provide a stable and reproducible comparison workload for benchmark
+ * runs and regression tracking.
+ *
+ */
+final class EnglishComparisonCorpus {
+
+ /**
+ * Canonical lexical bases used to generate the token workload.
+ */
+ private static final String[] BASES = { "analyze", "analyse", "color", "colour", "center", "centre", "organize",
+ "organise", "optimize", "optimise", "characterize", "characterise", "connect", "construct", "compute",
+ "design", "develop", "engineer", "govern", "improve", "index", "inform", "manage", "model", "observe",
+ "operate", "perform", "predict", "prepare", "process", "project", "protect", "publish", "query", "reduce",
+ "refresh", "render", "resolve", "return", "search", "select", "signal", "store", "structure", "support",
+ "transform", "update", "validate", "value" };
+
+ /**
+ * Utility class.
+ */
+ private EnglishComparisonCorpus() {
+ throw new AssertionError("No instances.");
+ }
+
+ /**
+ * Creates a deterministic token corpus for English stemming comparison.
+ *
+ * @param familyCount number of generated lexical families
+ * @return token array in stable order
+ */
+ static String[] createTokens(final int familyCount) {
+ if (familyCount < 1) {
+ throw new IllegalArgumentException("familyCount must be at least 1.");
+ }
+
+ final List tokens = new ArrayList<>(familyCount * 14);
+
+ for (int index = 0; index < familyCount; index++) {
+ final String base = createBase(index);
+
+ tokens.add(base);
+ tokens.add(base + "s");
+ tokens.add(base + "ed");
+ tokens.add(base + "ing");
+ tokens.add(base + "er");
+ tokens.add(base + "ers");
+ tokens.add(base + "ly");
+ tokens.add(base + "ness");
+ tokens.add(base + "ment");
+ tokens.add(base + "ments");
+ tokens.add(base + "able");
+ tokens.add(base + "ability");
+
+ if (base.endsWith("ize")) {
+ tokens.add(base.substring(0, base.length() - 3) + "isation");
+ tokens.add(base.substring(0, base.length() - 3) + "ised");
+ }
+
+ if (base.endsWith("ise")) {
+ tokens.add(base.substring(0, base.length() - 3) + "ization");
+ tokens.add(base.substring(0, base.length() - 3) + "ized");
+ }
+ }
+
+ return tokens.toArray(String[]::new);
+ }
+
+ /**
+ * Creates one deterministic base token.
+ *
+ * @param index base ordinal
+ * @return generated lexical base
+ */
+ private static String createBase(final int index) {
+ return (BASES[index % BASES.length] + suffix(index)).toLowerCase(Locale.ROOT);
+ }
+
+ /**
+ * Creates a compact discriminator suffix so that large corpora remain unique
+ * while retaining stable lexical families.
+ *
+ * @param value ordinal value
+ * @return compact discriminator
+ */
+ private static String suffix(final int value) {
+ return Integer.toString(value, Character.MAX_RADIX);
+ }
+}
\ No newline at end of file
diff --git a/src/jmh/java/org/egothor/stemmer/benchmark/EnglishStemmerComparisonBenchmark.java b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishStemmerComparisonBenchmark.java
new file mode 100644
index 0000000..27a6921
--- /dev/null
+++ b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishStemmerComparisonBenchmark.java
@@ -0,0 +1,168 @@
+package org.egothor.stemmer.benchmark;
+
+import java.io.IOException;
+import java.util.concurrent.TimeUnit;
+
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.PatchCommandEncoder;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Level;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+import org.openjdk.jmh.infra.Blackhole;
+import org.tartarus.snowball.ext.englishStemmer;
+import org.tartarus.snowball.ext.porterStemmer;
+
+/**
+ * Compares English stemming throughput across Radixor and Snowball stemmers.
+ *
+ *
+ * The benchmark processes the same deterministic token array with:
+ *
+ *
+ * - Radixor using bundled
+ * {@link StemmerPatchTrieLoader.Language#US_UK_PROFI}
+ * - Snowball original Porter stemmer
+ * - Snowball English stemmer, commonly referred to as Porter2
+ *
+ *
+ *
+ * This benchmark compares throughput on a shared workload. It does not imply
+ * that the algorithms are linguistically equivalent.
+ *
+ */
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+@Warmup(iterations = 3, time = 1)
+@Measurement(iterations = 5, time = 1)
+public class EnglishStemmerComparisonBenchmark {
+
+ /**
+ * Shared benchmark data.
+ */
+ @State(Scope.Benchmark)
+ public static class SharedState {
+
+ /**
+ * Number of generated lexical families.
+ */
+ @Param({ "1000", "5000" })
+ public int familyCount;
+
+ /**
+ * Token workload processed by all compared stemmers.
+ */
+ private String[] tokens;
+
+ /**
+ * Radixor trie loaded from the bundled professional English dictionary.
+ */
+ private FrequencyTrie radixorTrie;
+
+ /**
+ * Initializes the shared benchmark state.
+ *
+ * @throws IOException if the bundled Radixor dictionary cannot be loaded
+ */
+ @Setup(Level.Trial)
+ public void setUp() throws IOException {
+ this.tokens = EnglishComparisonCorpus.createTokens(this.familyCount);
+ this.radixorTrie = StemmerPatchTrieLoader.load(StemmerPatchTrieLoader.Language.US_UK_PROFI, true,
+ ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
+ }
+ }
+
+ /**
+ * Per-thread reusable Snowball stemmers.
+ */
+ @State(Scope.Thread)
+ public static class SnowballState {
+
+ /**
+ * Adapter for the original Porter stemmer.
+ */
+ private SnowballStemmerAdapter porterStemmer;
+
+ /**
+ * Adapter for the Snowball English stemmer.
+ */
+ private SnowballStemmerAdapter englishStemmer;
+
+ /**
+ * Initializes reusable Snowball stemmers for the executing thread.
+ */
+ @Setup(Level.Trial)
+ public void setUp() {
+ this.porterStemmer = new SnowballStemmerAdapter(porterStemmer::new);
+ this.englishStemmer = new SnowballStemmerAdapter(englishStemmer::new);
+ }
+ }
+
+ /**
+ * Measures Radixor preferred-result stemming throughput.
+ *
+ * @param sharedState shared benchmark data
+ * @param blackhole sink preventing dead-code elimination
+ */
+ @Benchmark
+ public void radixorUsUkProfiPreferredStem(final SharedState sharedState, final Blackhole blackhole) {
+ final String[] tokens = sharedState.tokens;
+ final FrequencyTrie trie = sharedState.radixorTrie;
+
+ for (String token : tokens) {
+ final String patch = trie.get(token);
+ final String stem = patch == null ? token : PatchCommandEncoder.apply(token, patch);
+ blackhole.consume(stem);
+ }
+ }
+
+ /**
+ * Measures Snowball original Porter stemming throughput.
+ *
+ * @param sharedState shared benchmark data
+ * @param snowballState reusable Snowball stemmers
+ * @param blackhole sink preventing dead-code elimination
+ */
+ @Benchmark
+ public void snowballOriginalPorter(final SharedState sharedState, final SnowballState snowballState,
+ final Blackhole blackhole) {
+ final String[] tokens = sharedState.tokens;
+ final SnowballStemmerAdapter stemmer = snowballState.porterStemmer;
+
+ for (String token : tokens) {
+ blackhole.consume(stemmer.stem(token));
+ }
+ }
+
+ /**
+ * Measures Snowball English stemming throughput.
+ *
+ *
+ * Snowball English is the newer English stemmer commonly referred to as
+ * Porter2.
+ *
+ *
+ * @param sharedState shared benchmark data
+ * @param snowballState reusable Snowball stemmers
+ * @param blackhole sink preventing dead-code elimination
+ */
+ @Benchmark
+ public void snowballEnglishPorter2(final SharedState sharedState, final SnowballState snowballState,
+ final Blackhole blackhole) {
+ final String[] tokens = sharedState.tokens;
+ final SnowballStemmerAdapter stemmer = snowballState.englishStemmer;
+
+ for (String token : tokens) {
+ blackhole.consume(stemmer.stem(token));
+ }
+ }
+}
\ No newline at end of file
diff --git a/src/jmh/java/org/egothor/stemmer/benchmark/SnowballStemmerAdapter.java b/src/jmh/java/org/egothor/stemmer/benchmark/SnowballStemmerAdapter.java
new file mode 100644
index 0000000..0ff678c
--- /dev/null
+++ b/src/jmh/java/org/egothor/stemmer/benchmark/SnowballStemmerAdapter.java
@@ -0,0 +1,57 @@
+package org.egothor.stemmer.benchmark;
+
+import java.util.Objects;
+
+import org.tartarus.snowball.SnowballStemmer;
+
+/**
+ * Small adapter around a Snowball stemmer instance used by benchmarks.
+ *
+ *
+ * The adapter keeps the benchmark code focused on the actual workload while
+ * still allowing a professional separation between benchmark orchestration and
+ * third-party stemming API calls.
+ *
+ */
+final class SnowballStemmerAdapter {
+
+ /**
+ * Factory of Snowball stemmer instances.
+ */
+ @FunctionalInterface
+ interface Factory {
+
+ /**
+ * Creates a new Snowball stemmer instance.
+ *
+ * @return new Snowball stemmer
+ */
+ SnowballStemmer create();
+ }
+
+ /**
+ * Reusable Snowball stemmer instance.
+ */
+ private final SnowballStemmer stemmer;
+
+ /**
+ * Creates a new adapter.
+ *
+ * @param factory factory creating the concrete Snowball stemmer
+ */
+ SnowballStemmerAdapter(final Factory factory) {
+ this.stemmer = Objects.requireNonNull(factory, "factory").create();
+ }
+
+ /**
+ * Applies stemming to the supplied token.
+ *
+ * @param token input token
+ * @return produced stem
+ */
+ String stem(final String token) {
+ this.stemmer.setCurrent(token);
+ this.stemmer.stem();
+ return this.stemmer.getCurrent();
+ }
+}
\ No newline at end of file