feat: add JMH comparison benchmarks for Radixor vs Snowball Porter stemmers

build: isolate Snowball benchmark integration into dedicated Gradle script docs: highlight benchmarked throughput advantage in README docs: add detailed benchmarking guide and execution notes
2026-04-14 18:25:41 +02:00
parent 85e33f2f60
commit 6b3559097a
9 changed files with 565 additions and 3 deletions
--- a/.classpath
+++ b/.classpath
@@ -26,6 +26,13 @@
 			<attribute name="test" value="true"/>
 		</attributes>
 	</classpathentry>
 	<classpathentry kind="src" output="bin/jmh" path="build/third-party/snowball/source/libstemmer_java-3.0.1/java">
 		<attributes>
 			<attribute name="gradle_scope" value="jmh"/>
 			<attribute name="gradle_used_by_scope" value="jmh"/>
 			<attribute name="test" value="true"/>
 		</attributes>
 	</classpathentry>
 	<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-21/"/>
 	<classpathentry kind="con" path="org.eclipse.buildship.core.gradleclasspathcontainer"/>
 	<classpathentry kind="output" path="bin/default"/>
--- a/README.md
+++ b/README.md
@@ -2,10 +2,20 @@
 # Radixor
-*Fast algorithmic stemming with compact patch-command tries.*
+*Fast algorithmic stemming with compact patch-command tries — measured at about 4× to 6× the throughput of the Snowball Porter stemmer family on the current English benchmark workload.*
 **Radixor** is a fast, algorithmic stemming toolkit for Java, built around compact **patch-command tries** in the tradition of the original **Egothor** stemmer.
 On the current JMH English comparison benchmark, Radixor with bundled `US_UK_PROFI`
 reaches approximately **31 to 32 million tokens per second**, compared with about
 **8 million tokens per second** for Snowball original Porter and about
 **5 to 5.5 million tokens per second** for Snowball English (Porter2).
 That means the current Radixor implementation is approximately:
 - **4× faster** than Snowball original Porter
 - **6× faster** than Snowball English (Porter2)
 It is designed for production search and text-processing systems that need stemming which is:
 - fast at runtime
@@ -22,6 +32,7 @@ Radixor keeps the valuable core of the original Egothor idea, modernizes the imp
 - [Heritage](#heritage)
 - [What Radixor adds](#what-radixor-adds)
 - [Key features](#key-features)
 - [Performance](#performance)
 - [Documentation](#documentation)
 - [Project philosophy](#project-philosophy)
 - [Historical note](#historical-note)
@@ -37,7 +48,7 @@ This gives you a stemmer that is:
 - compact enough for deployment-friendly binary artifacts
 - suitable for both offline compilation and runtime loading
-Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer.
+Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer. In the current English benchmark comparison against the Snowball Porter stemmer family, it also delivers a substantial throughput advantage.
 ## Heritage
@@ -95,6 +106,27 @@ Compared with the historical baseline, Radixor emphasizes:
 - Bundled language resources
 - Support for extending compiled stemmer tables
 ## Performance
 Radixor includes a JMH benchmark suite for both its own algorithmic core and a
 side-by-side comparison against the Snowball Porter stemmer family.
 On the current English comparison workload, Radixor with bundled `US_UK_PROFI`
 reaches approximately **31 to 32 million tokens per second**. Snowball original
 Porter reaches approximately **8 million tokens per second**, and Snowball
 English (Porter2) approximately **5 to 5.5 million tokens per second**.
 That places Radixor at approximately **4× the throughput of Snowball original Porter**
 and approximately **6× the throughput of Snowball English (Porter2)**
 on the current benchmark workload.
 This is a throughput comparison on the same deterministic token stream. It is
 not a claim that the compared stemmers are linguistically equivalent or
 interchangeable.
 For benchmark scope, workload design, environment, commands, report locations,
 and interpretation guidance, see [Benchmarking](docs/benchmarking.md).
 ## Documentation
 The repository keeps the front page concise and places detailed documentation under `docs/`.
@@ -122,6 +154,9 @@ Start here:
 - [Quality and Operations](docs/quality-and-operations.md)  
  Testing, persistence, deployment, and operational guidance.
 - [Benchmarking](docs/benchmarking.md)  
  JMH benchmark design, Snowball comparison, execution, and interpretation.
 ## Project philosophy
 Radixor does not preserve historical complexity for its own sake.
--- a/Radixor.png
+++ b/Radixor.png
--- a/build.gradle
+++ b/build.gradle
@@ -125,7 +125,7 @@ jmh {
 tasks.named('jmh') {
    group = 'verification'
-    description = 'Runs JMH benchmarks for the Radixor algorithmic core.'
+    description = 'Runs JMH benchmarks for the Radixor algorithmic core and Snowball comparison suite.'
 }
 javadoc {
@@ -154,6 +154,8 @@ javadoc {
    source = sourceSets.main.allJava
 }
 apply from: 'gradle/snowball-benchmarks.gradle'
 gradle.taskGraph.whenReady { taskGraph ->
    def banner = """
 \u001B[34m
--- a/docs/benchmarking.md
+++ b/docs/benchmarking.md
@@ -0,0 +1,134 @@
 # Benchmarking
 > ← Back to [README.md](../README.md)
 Radixor includes a JMH benchmark suite for both the internal algorithmic core and a side-by-side English comparison against the Snowball Porter stemmer family.
 This document explains what is benchmarked, how to run it, and how to interpret the results responsibly.
 ## Scope
 The benchmark suite currently covers two categories:
 - Radixor core operations
 - English stemmer comparison on the same token workload
 The comparison benchmark processes the same deterministic English token stream through:
 - Radixor with bundled `US_UK_PROFI`
 - Snowball original Porter
 - Snowball English, commonly referred to as Porter2
 The purpose of the comparison is throughput measurement on identical input. It is not intended to prove linguistic equivalence between the compared stemmers.
 ## Current snapshot
 A recent JMH run on JDK 21.0.10 with JMH 1.37, one thread, three warmup iterations, and five measurement iterations produced the following approximate throughput ranges:
 | Workload | Radixor `US_UK_PROFI` | Snowball Porter | Snowball English |
 | --- | ---: | ---: | ---: |
 | About 12,000 generated tokens | 30.99 M tokens/s | 8.21 M tokens/s | 5.46 M tokens/s |
 | About 60,000 generated tokens | 32.25 M tokens/s | 8.02 M tokens/s | 5.11 M tokens/s |
 On that workload, Radixor is approximately:
 - 4 times faster than Snowball original Porter
 - 6 times faster than Snowball English
 These values are workload- and environment-dependent. Treat them as measured results for the documented benchmark setup, not as universal constants.
 ## Benchmark classes
 The main benchmark classes are under `src/jmh/java/org/egothor/stemmer/benchmark`.
 Relevant classes include:
 - `FrequencyTrieLookupBenchmark`
 - `FrequencyTrieCompilationBenchmark`
 - `EnglishStemmerComparisonBenchmark`
 The English comparison benchmark uses the bundled Radixor English resource and the official Snowball Java distribution integrated into the JMH source set.
 ## Workload design
 The English comparison benchmark uses a deterministic generated corpus rather than an uncontrolled ad hoc text sample.
 The workload intentionally mixes:
 - simple inflections
 - common derivational forms
 - US and UK spelling families
 - lexical forms appropriate for `US_UK_PROFI`
 This design keeps runs reproducible across environments and avoids accidental drift caused by changing external corpora.
 ## Running benchmarks
 Run the full benchmark suite:
 ```bash
 ./gradlew jmh
 ```
 Run only the English comparison benchmark:
 ```bash
 ./gradlew jmh -Pjmh.includes=EnglishStemmerComparisonBenchmark
 ```
 ## Generated reports
 JMH reports are written to:
 - `build/reports/jmh/jmh-results.txt`
 - `build/reports/jmh/jmh-results.csv`
 The text report is convenient for human review. The CSV report is more useful for CI archiving, historical tracking, and external processing.
 ## Interpreting results
 Benchmark numbers should be read with care.
 Important factors include:
 - CPU model and frequency behavior
 - thermal throttling
 - JVM vendor and version
 - system background load
 - operating-system scheduling noise
 - benchmark parameter changes
 For meaningful comparison, keep these stable:
 - hardware or VM class
 - JDK version
 - benchmark parameters
 - thread count
 - benchmark source revision
 If a regression is suspected, repeat the run and compare against the previous CSV output rather than relying on a single measurement.
 ## Regression tracking
 The recommended regression workflow is:
 1. archive `jmh-results.csv`
 2. compare the same benchmark names across runs
 3. compare only like-for-like environments
 4. investigate sustained regressions rather than one-off noise
 For public reporting, the README should keep only the condensed benchmark summary, while detailed benchmark methodology and interpretation should remain in this document.
 ## Notes on comparison fairness
 Radixor, Snowball Porter, and Snowball English are not the same kind of stemmer.
 Radixor uses a compiled patch-command trie driven by dictionary data. Snowball Porter and Snowball English are rule-based English stemmers.
 Because of that, the comparison should be understood as:
 - equal input workload
 - different stemming strategies
 - measured throughput, not semantic identity
 That distinction matters whenever performance claims are discussed in documentation or release notes.
--- a/gradle/snowball-benchmarks.gradle
+++ b/gradle/snowball-benchmarks.gradle
@@ -0,0 +1,49 @@
 def snowballVersion = '3.0.1'
 def snowballArchiveName = "libstemmer_java-${snowballVersion}.tar.gz"
 def snowballDownloadUrl = "https://snowballstem.org/dist/${snowballArchiveName}"
 def snowballDownloadFile = layout.buildDirectory.file("third-party/snowball/${snowballArchiveName}")
 def snowballExtractDirectory = layout.buildDirectory.dir('third-party/snowball/source')
 def snowballJavaSourceDirectory = layout.buildDirectory.dir(
        "third-party/snowball/source/libstemmer_java-${snowballVersion}/java")
 tasks.register('downloadSnowballJava') {
    group = 'build setup'
    description = 'Downloads the official Snowball Java source distribution for benchmark-only use.'
    outputs.file(snowballDownloadFile)
    doLast {
        File targetFile = snowballDownloadFile.get().asFile
        targetFile.parentFile.mkdirs()
        if (!targetFile.exists()) {
            new URL(snowballDownloadUrl).withInputStream { inputStream ->
                targetFile.withOutputStream { outputStream ->
                    outputStream << inputStream
                }
            }
        }
    }
 }
 tasks.register('extractSnowballJava', Copy) {
    group = 'build setup'
    description = 'Extracts the official Snowball Java source distribution.'
    dependsOn(tasks.named('downloadSnowballJava'))
    from(tarTree(resources.gzip(snowballDownloadFile)))
    into(snowballExtractDirectory)
 }
 sourceSets {
    jmh {
        java {
            srcDir(snowballJavaSourceDirectory)
        }
    }
 }
 tasks.named('compileJmhJava') {
    dependsOn(tasks.named('extractSnowballJava'))
 }
--- a/src/jmh/java/org/egothor/stemmer/benchmark/EnglishComparisonCorpus.java
+++ b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishComparisonCorpus.java
@@ -0,0 +1,110 @@
 package org.egothor.stemmer.benchmark;
 import java.util.ArrayList;
 import java.util.List;
 import java.util.Locale;
 /**
 * Builds a deterministic English token corpus for side-by-side stemming
 * benchmarks.
 *
 * <p>
 * The generated corpus mixes:
 * </p>
 * <ul>
 * <li>simple inflections</li>
 * <li>common derivational forms</li>
 * <li>US/UK spelling families</li>
 * <li>forms that are suitable for comparison against the bundled
 * {@code US_UK_PROFI} Radixor dictionary</li>
 * </ul>
 *
 * <p>
 * The goal is not to simulate natural language frequency distribution exactly,
 * but to provide a stable and reproducible comparison workload for benchmark
 * runs and regression tracking.
 * </p>
 */
 final class EnglishComparisonCorpus {
    /**
     * Canonical lexical bases used to generate the token workload.
     */
    private static final String[] BASES = { "analyze", "analyse", "color", "colour", "center", "centre", "organize",
            "organise", "optimize", "optimise", "characterize", "characterise", "connect", "construct", "compute",
            "design", "develop", "engineer", "govern", "improve", "index", "inform", "manage", "model", "observe",
            "operate", "perform", "predict", "prepare", "process", "project", "protect", "publish", "query", "reduce",
            "refresh", "render", "resolve", "return", "search", "select", "signal", "store", "structure", "support",
            "transform", "update", "validate", "value" };
    /**
     * Utility class.
     */
    private EnglishComparisonCorpus() {
        throw new AssertionError("No instances.");
    }
    /**
     * Creates a deterministic token corpus for English stemming comparison.
     *
     * @param familyCount number of generated lexical families
     * @return token array in stable order
     */
    static String[] createTokens(final int familyCount) {
        if (familyCount < 1) {
            throw new IllegalArgumentException("familyCount must be at least 1.");
        }
        final List<String> tokens = new ArrayList<>(familyCount * 14);
        for (int index = 0; index < familyCount; index++) {
            final String base = createBase(index);
            tokens.add(base);
            tokens.add(base + "s");
            tokens.add(base + "ed");
            tokens.add(base + "ing");
            tokens.add(base + "er");
            tokens.add(base + "ers");
            tokens.add(base + "ly");
            tokens.add(base + "ness");
            tokens.add(base + "ment");
            tokens.add(base + "ments");
            tokens.add(base + "able");
            tokens.add(base + "ability");
            if (base.endsWith("ize")) {
                tokens.add(base.substring(0, base.length() - 3) + "isation");
                tokens.add(base.substring(0, base.length() - 3) + "ised");
            }
            if (base.endsWith("ise")) {
                tokens.add(base.substring(0, base.length() - 3) + "ization");
                tokens.add(base.substring(0, base.length() - 3) + "ized");
            }
        }
        return tokens.toArray(String[]::new);
    }
    /**
     * Creates one deterministic base token.
     *
     * @param index base ordinal
     * @return generated lexical base
     */
    private static String createBase(final int index) {
        return (BASES[index % BASES.length] + suffix(index)).toLowerCase(Locale.ROOT);
    }
    /**
     * Creates a compact discriminator suffix so that large corpora remain unique
     * while retaining stable lexical families.
     *
     * @param value ordinal value
     * @return compact discriminator
     */
    private static String suffix(final int value) {
        return Integer.toString(value, Character.MAX_RADIX);
    }
 }
--- a/src/jmh/java/org/egothor/stemmer/benchmark/EnglishStemmerComparisonBenchmark.java
+++ b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishStemmerComparisonBenchmark.java
@@ -0,0 +1,168 @@
 package org.egothor.stemmer.benchmark;
 import java.io.IOException;
 import java.util.concurrent.TimeUnit;
 import org.egothor.stemmer.FrequencyTrie;
 import org.egothor.stemmer.PatchCommandEncoder;
 import org.egothor.stemmer.ReductionMode;
 import org.egothor.stemmer.StemmerPatchTrieLoader;
 import org.openjdk.jmh.annotations.Benchmark;
 import org.openjdk.jmh.annotations.BenchmarkMode;
 import org.openjdk.jmh.annotations.Level;
 import org.openjdk.jmh.annotations.Measurement;
 import org.openjdk.jmh.annotations.Mode;
 import org.openjdk.jmh.annotations.OutputTimeUnit;
 import org.openjdk.jmh.annotations.Param;
 import org.openjdk.jmh.annotations.Scope;
 import org.openjdk.jmh.annotations.Setup;
 import org.openjdk.jmh.annotations.State;
 import org.openjdk.jmh.annotations.Warmup;
 import org.openjdk.jmh.infra.Blackhole;
 import org.tartarus.snowball.ext.englishStemmer;
 import org.tartarus.snowball.ext.porterStemmer;
 /**
 * Compares English stemming throughput across Radixor and Snowball stemmers.
 *
 * <p>
 * The benchmark processes the same deterministic token array with:
 * </p>
 * <ul>
 * <li>Radixor using bundled
 * {@link StemmerPatchTrieLoader.Language#US_UK_PROFI}</li>
 * <li>Snowball original Porter stemmer</li>
 * <li>Snowball English stemmer, commonly referred to as Porter2</li>
 * </ul>
 *
 * <p>
 * This benchmark compares throughput on a shared workload. It does not imply
 * that the algorithms are linguistically equivalent.
 * </p>
 */
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
 public class EnglishStemmerComparisonBenchmark {
    /**
     * Shared benchmark data.
     */
    @State(Scope.Benchmark)
    public static class SharedState {
        /**
         * Number of generated lexical families.
         */
        @Param({ "1000", "5000" })
        public int familyCount;
        /**
         * Token workload processed by all compared stemmers.
         */
        private String[] tokens;
        /**
         * Radixor trie loaded from the bundled professional English dictionary.
         */
        private FrequencyTrie<String> radixorTrie;
        /**
         * Initializes the shared benchmark state.
         *
         * @throws IOException if the bundled Radixor dictionary cannot be loaded
         */
        @Setup(Level.Trial)
        public void setUp() throws IOException {
            this.tokens = EnglishComparisonCorpus.createTokens(this.familyCount);
            this.radixorTrie = StemmerPatchTrieLoader.load(StemmerPatchTrieLoader.Language.US_UK_PROFI, true,
                    ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
        }
    }
    /**
     * Per-thread reusable Snowball stemmers.
     */
    @State(Scope.Thread)
    public static class SnowballState {
        /**
         * Adapter for the original Porter stemmer.
         */
        private SnowballStemmerAdapter porterStemmer;
        /**
         * Adapter for the Snowball English stemmer.
         */
        private SnowballStemmerAdapter englishStemmer;
        /**
         * Initializes reusable Snowball stemmers for the executing thread.
         */
        @Setup(Level.Trial)
        public void setUp() {
            this.porterStemmer = new SnowballStemmerAdapter(porterStemmer::new);
            this.englishStemmer = new SnowballStemmerAdapter(englishStemmer::new);
        }
    }
    /**
     * Measures Radixor preferred-result stemming throughput.
     *
     * @param sharedState shared benchmark data
     * @param blackhole   sink preventing dead-code elimination
     */
    @Benchmark
    public void radixorUsUkProfiPreferredStem(final SharedState sharedState, final Blackhole blackhole) {
        final String[] tokens = sharedState.tokens;
        final FrequencyTrie<String> trie = sharedState.radixorTrie;
        for (String token : tokens) {
            final String patch = trie.get(token);
            final String stem = patch == null ? token : PatchCommandEncoder.apply(token, patch);
            blackhole.consume(stem);
        }
    }
    /**
     * Measures Snowball original Porter stemming throughput.
     *
     * @param sharedState   shared benchmark data
     * @param snowballState reusable Snowball stemmers
     * @param blackhole     sink preventing dead-code elimination
     */
    @Benchmark
    public void snowballOriginalPorter(final SharedState sharedState, final SnowballState snowballState,
            final Blackhole blackhole) {
        final String[] tokens = sharedState.tokens;
        final SnowballStemmerAdapter stemmer = snowballState.porterStemmer;
        for (String token : tokens) {
            blackhole.consume(stemmer.stem(token));
        }
    }
    /**
     * Measures Snowball English stemming throughput.
     *
     * <p>
     * Snowball English is the newer English stemmer commonly referred to as
     * Porter2.
     * </p>
     *
     * @param sharedState   shared benchmark data
     * @param snowballState reusable Snowball stemmers
     * @param blackhole     sink preventing dead-code elimination
     */
    @Benchmark
    public void snowballEnglishPorter2(final SharedState sharedState, final SnowballState snowballState,
            final Blackhole blackhole) {
        final String[] tokens = sharedState.tokens;
        final SnowballStemmerAdapter stemmer = snowballState.englishStemmer;
        for (String token : tokens) {
            blackhole.consume(stemmer.stem(token));
        }
    }
 }
--- a/src/jmh/java/org/egothor/stemmer/benchmark/SnowballStemmerAdapter.java
+++ b/src/jmh/java/org/egothor/stemmer/benchmark/SnowballStemmerAdapter.java
@@ -0,0 +1,57 @@
 package org.egothor.stemmer.benchmark;
 import java.util.Objects;
 import org.tartarus.snowball.SnowballStemmer;
 /**
 * Small adapter around a Snowball stemmer instance used by benchmarks.
 *
 * <p>
 * The adapter keeps the benchmark code focused on the actual workload while
 * still allowing a professional separation between benchmark orchestration and
 * third-party stemming API calls.
 * </p>
 */
 final class SnowballStemmerAdapter {
    /**
     * Factory of Snowball stemmer instances.
     */
    @FunctionalInterface
    interface Factory {
        /**
         * Creates a new Snowball stemmer instance.
         *
         * @return new Snowball stemmer
         */
        SnowballStemmer create();
    }
    /**
     * Reusable Snowball stemmer instance.
     */
    private final SnowballStemmer stemmer;
    /**
     * Creates a new adapter.
     *
     * @param factory factory creating the concrete Snowball stemmer
     */
    SnowballStemmerAdapter(final Factory factory) {
        this.stemmer = Objects.requireNonNull(factory, "factory").create();
    }
    /**
     * Applies stemming to the supplied token.
     *
     * @param token input token
     * @return produced stem
     */
    String stem(final String token) {
        this.stemmer.setCurrent(token);
        this.stemmer.stem();
        return this.stemmer.getCurrent();
    }
 }