diff --git a/.classpath b/.classpath index e3393aa..a079cbf 100644 --- a/.classpath +++ b/.classpath @@ -26,6 +26,13 @@ + + + + + + + diff --git a/README.md b/README.md index b775373..d6e3cde 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,20 @@ # Radixor -*Fast algorithmic stemming with compact patch-command tries.* +*Fast algorithmic stemming with compact patch-command tries — measured at about 4× to 6× the throughput of the Snowball Porter stemmer family on the current English benchmark workload.* **Radixor** is a fast, algorithmic stemming toolkit for Java, built around compact **patch-command tries** in the tradition of the original **Egothor** stemmer. +On the current JMH English comparison benchmark, Radixor with bundled `US_UK_PROFI` +reaches approximately **31 to 32 million tokens per second**, compared with about +**8 million tokens per second** for Snowball original Porter and about +**5 to 5.5 million tokens per second** for Snowball English (Porter2). + +That means the current Radixor implementation is approximately: + +- **4× faster** than Snowball original Porter +- **6× faster** than Snowball English (Porter2) + It is designed for production search and text-processing systems that need stemming which is: - fast at runtime @@ -22,6 +32,7 @@ Radixor keeps the valuable core of the original Egothor idea, modernizes the imp - [Heritage](#heritage) - [What Radixor adds](#what-radixor-adds) - [Key features](#key-features) +- [Performance](#performance) - [Documentation](#documentation) - [Project philosophy](#project-philosophy) - [Historical note](#historical-note) @@ -37,7 +48,7 @@ This gives you a stemmer that is: - compact enough for deployment-friendly binary artifacts - suitable for both offline compilation and runtime loading -Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer. +Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer. In the current English benchmark comparison against the Snowball Porter stemmer family, it also delivers a substantial throughput advantage. ## Heritage @@ -95,6 +106,27 @@ Compared with the historical baseline, Radixor emphasizes: - Bundled language resources - Support for extending compiled stemmer tables +## Performance + +Radixor includes a JMH benchmark suite for both its own algorithmic core and a +side-by-side comparison against the Snowball Porter stemmer family. + +On the current English comparison workload, Radixor with bundled `US_UK_PROFI` +reaches approximately **31 to 32 million tokens per second**. Snowball original +Porter reaches approximately **8 million tokens per second**, and Snowball +English (Porter2) approximately **5 to 5.5 million tokens per second**. + +That places Radixor at approximately **4× the throughput of Snowball original Porter** +and approximately **6× the throughput of Snowball English (Porter2)** +on the current benchmark workload. + +This is a throughput comparison on the same deterministic token stream. It is +not a claim that the compared stemmers are linguistically equivalent or +interchangeable. + +For benchmark scope, workload design, environment, commands, report locations, +and interpretation guidance, see [Benchmarking](docs/benchmarking.md). + ## Documentation The repository keeps the front page concise and places detailed documentation under `docs/`. @@ -122,6 +154,9 @@ Start here: - [Quality and Operations](docs/quality-and-operations.md) Testing, persistence, deployment, and operational guidance. +- [Benchmarking](docs/benchmarking.md) + JMH benchmark design, Snowball comparison, execution, and interpretation. + ## Project philosophy Radixor does not preserve historical complexity for its own sake. diff --git a/Radixor.png b/Radixor.png index 23ff4d0..0aad67b 100644 Binary files a/Radixor.png and b/Radixor.png differ diff --git a/build.gradle b/build.gradle index 4b40b17..fda5b23 100644 --- a/build.gradle +++ b/build.gradle @@ -125,7 +125,7 @@ jmh { tasks.named('jmh') { group = 'verification' - description = 'Runs JMH benchmarks for the Radixor algorithmic core.' + description = 'Runs JMH benchmarks for the Radixor algorithmic core and Snowball comparison suite.' } javadoc { @@ -154,6 +154,8 @@ javadoc { source = sourceSets.main.allJava } +apply from: 'gradle/snowball-benchmarks.gradle' + gradle.taskGraph.whenReady { taskGraph -> def banner = """ \u001B[34m diff --git a/docs/benchmarking.md b/docs/benchmarking.md new file mode 100644 index 0000000..9cf75e3 --- /dev/null +++ b/docs/benchmarking.md @@ -0,0 +1,134 @@ +# Benchmarking + +> ← Back to [README.md](../README.md) + +Radixor includes a JMH benchmark suite for both the internal algorithmic core and a side-by-side English comparison against the Snowball Porter stemmer family. + +This document explains what is benchmarked, how to run it, and how to interpret the results responsibly. + +## Scope + +The benchmark suite currently covers two categories: + +- Radixor core operations +- English stemmer comparison on the same token workload + +The comparison benchmark processes the same deterministic English token stream through: + +- Radixor with bundled `US_UK_PROFI` +- Snowball original Porter +- Snowball English, commonly referred to as Porter2 + +The purpose of the comparison is throughput measurement on identical input. It is not intended to prove linguistic equivalence between the compared stemmers. + +## Current snapshot + +A recent JMH run on JDK 21.0.10 with JMH 1.37, one thread, three warmup iterations, and five measurement iterations produced the following approximate throughput ranges: + +| Workload | Radixor `US_UK_PROFI` | Snowball Porter | Snowball English | +| --- | ---: | ---: | ---: | +| About 12,000 generated tokens | 30.99 M tokens/s | 8.21 M tokens/s | 5.46 M tokens/s | +| About 60,000 generated tokens | 32.25 M tokens/s | 8.02 M tokens/s | 5.11 M tokens/s | + +On that workload, Radixor is approximately: + +- 4 times faster than Snowball original Porter +- 6 times faster than Snowball English + +These values are workload- and environment-dependent. Treat them as measured results for the documented benchmark setup, not as universal constants. + +## Benchmark classes + +The main benchmark classes are under `src/jmh/java/org/egothor/stemmer/benchmark`. + +Relevant classes include: + +- `FrequencyTrieLookupBenchmark` +- `FrequencyTrieCompilationBenchmark` +- `EnglishStemmerComparisonBenchmark` + +The English comparison benchmark uses the bundled Radixor English resource and the official Snowball Java distribution integrated into the JMH source set. + +## Workload design + +The English comparison benchmark uses a deterministic generated corpus rather than an uncontrolled ad hoc text sample. + +The workload intentionally mixes: + +- simple inflections +- common derivational forms +- US and UK spelling families +- lexical forms appropriate for `US_UK_PROFI` + +This design keeps runs reproducible across environments and avoids accidental drift caused by changing external corpora. + +## Running benchmarks + +Run the full benchmark suite: + +```bash +./gradlew jmh +``` + +Run only the English comparison benchmark: + +```bash +./gradlew jmh -Pjmh.includes=EnglishStemmerComparisonBenchmark +``` + +## Generated reports + +JMH reports are written to: + +- `build/reports/jmh/jmh-results.txt` +- `build/reports/jmh/jmh-results.csv` + +The text report is convenient for human review. The CSV report is more useful for CI archiving, historical tracking, and external processing. + +## Interpreting results + +Benchmark numbers should be read with care. + +Important factors include: + +- CPU model and frequency behavior +- thermal throttling +- JVM vendor and version +- system background load +- operating-system scheduling noise +- benchmark parameter changes + +For meaningful comparison, keep these stable: + +- hardware or VM class +- JDK version +- benchmark parameters +- thread count +- benchmark source revision + +If a regression is suspected, repeat the run and compare against the previous CSV output rather than relying on a single measurement. + +## Regression tracking + +The recommended regression workflow is: + +1. archive `jmh-results.csv` +2. compare the same benchmark names across runs +3. compare only like-for-like environments +4. investigate sustained regressions rather than one-off noise + +For public reporting, the README should keep only the condensed benchmark summary, while detailed benchmark methodology and interpretation should remain in this document. + +## Notes on comparison fairness + +Radixor, Snowball Porter, and Snowball English are not the same kind of stemmer. + +Radixor uses a compiled patch-command trie driven by dictionary data. Snowball Porter and Snowball English are rule-based English stemmers. + +Because of that, the comparison should be understood as: + +- equal input workload +- different stemming strategies +- measured throughput, not semantic identity + +That distinction matters whenever performance claims are discussed in documentation or release notes. diff --git a/gradle/snowball-benchmarks.gradle b/gradle/snowball-benchmarks.gradle new file mode 100644 index 0000000..8e93fd5 --- /dev/null +++ b/gradle/snowball-benchmarks.gradle @@ -0,0 +1,49 @@ +def snowballVersion = '3.0.1' +def snowballArchiveName = "libstemmer_java-${snowballVersion}.tar.gz" +def snowballDownloadUrl = "https://snowballstem.org/dist/${snowballArchiveName}" +def snowballDownloadFile = layout.buildDirectory.file("third-party/snowball/${snowballArchiveName}") +def snowballExtractDirectory = layout.buildDirectory.dir('third-party/snowball/source') +def snowballJavaSourceDirectory = layout.buildDirectory.dir( + "third-party/snowball/source/libstemmer_java-${snowballVersion}/java") + +tasks.register('downloadSnowballJava') { + group = 'build setup' + description = 'Downloads the official Snowball Java source distribution for benchmark-only use.' + + outputs.file(snowballDownloadFile) + + doLast { + File targetFile = snowballDownloadFile.get().asFile + targetFile.parentFile.mkdirs() + + if (!targetFile.exists()) { + new URL(snowballDownloadUrl).withInputStream { inputStream -> + targetFile.withOutputStream { outputStream -> + outputStream << inputStream + } + } + } + } +} + +tasks.register('extractSnowballJava', Copy) { + group = 'build setup' + description = 'Extracts the official Snowball Java source distribution.' + + dependsOn(tasks.named('downloadSnowballJava')) + + from(tarTree(resources.gzip(snowballDownloadFile))) + into(snowballExtractDirectory) +} + +sourceSets { + jmh { + java { + srcDir(snowballJavaSourceDirectory) + } + } +} + +tasks.named('compileJmhJava') { + dependsOn(tasks.named('extractSnowballJava')) +} \ No newline at end of file diff --git a/src/jmh/java/org/egothor/stemmer/benchmark/EnglishComparisonCorpus.java b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishComparisonCorpus.java new file mode 100644 index 0000000..67fa05d --- /dev/null +++ b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishComparisonCorpus.java @@ -0,0 +1,110 @@ +package org.egothor.stemmer.benchmark; + +import java.util.ArrayList; +import java.util.List; +import java.util.Locale; + +/** + * Builds a deterministic English token corpus for side-by-side stemming + * benchmarks. + * + *

+ * The generated corpus mixes: + *

+ *
    + *
  • simple inflections
  • + *
  • common derivational forms
  • + *
  • US/UK spelling families
  • + *
  • forms that are suitable for comparison against the bundled + * {@code US_UK_PROFI} Radixor dictionary
  • + *
+ * + *

+ * The goal is not to simulate natural language frequency distribution exactly, + * but to provide a stable and reproducible comparison workload for benchmark + * runs and regression tracking. + *

+ */ +final class EnglishComparisonCorpus { + + /** + * Canonical lexical bases used to generate the token workload. + */ + private static final String[] BASES = { "analyze", "analyse", "color", "colour", "center", "centre", "organize", + "organise", "optimize", "optimise", "characterize", "characterise", "connect", "construct", "compute", + "design", "develop", "engineer", "govern", "improve", "index", "inform", "manage", "model", "observe", + "operate", "perform", "predict", "prepare", "process", "project", "protect", "publish", "query", "reduce", + "refresh", "render", "resolve", "return", "search", "select", "signal", "store", "structure", "support", + "transform", "update", "validate", "value" }; + + /** + * Utility class. + */ + private EnglishComparisonCorpus() { + throw new AssertionError("No instances."); + } + + /** + * Creates a deterministic token corpus for English stemming comparison. + * + * @param familyCount number of generated lexical families + * @return token array in stable order + */ + static String[] createTokens(final int familyCount) { + if (familyCount < 1) { + throw new IllegalArgumentException("familyCount must be at least 1."); + } + + final List tokens = new ArrayList<>(familyCount * 14); + + for (int index = 0; index < familyCount; index++) { + final String base = createBase(index); + + tokens.add(base); + tokens.add(base + "s"); + tokens.add(base + "ed"); + tokens.add(base + "ing"); + tokens.add(base + "er"); + tokens.add(base + "ers"); + tokens.add(base + "ly"); + tokens.add(base + "ness"); + tokens.add(base + "ment"); + tokens.add(base + "ments"); + tokens.add(base + "able"); + tokens.add(base + "ability"); + + if (base.endsWith("ize")) { + tokens.add(base.substring(0, base.length() - 3) + "isation"); + tokens.add(base.substring(0, base.length() - 3) + "ised"); + } + + if (base.endsWith("ise")) { + tokens.add(base.substring(0, base.length() - 3) + "ization"); + tokens.add(base.substring(0, base.length() - 3) + "ized"); + } + } + + return tokens.toArray(String[]::new); + } + + /** + * Creates one deterministic base token. + * + * @param index base ordinal + * @return generated lexical base + */ + private static String createBase(final int index) { + return (BASES[index % BASES.length] + suffix(index)).toLowerCase(Locale.ROOT); + } + + /** + * Creates a compact discriminator suffix so that large corpora remain unique + * while retaining stable lexical families. + * + * @param value ordinal value + * @return compact discriminator + */ + private static String suffix(final int value) { + return Integer.toString(value, Character.MAX_RADIX); + } +} \ No newline at end of file diff --git a/src/jmh/java/org/egothor/stemmer/benchmark/EnglishStemmerComparisonBenchmark.java b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishStemmerComparisonBenchmark.java new file mode 100644 index 0000000..27a6921 --- /dev/null +++ b/src/jmh/java/org/egothor/stemmer/benchmark/EnglishStemmerComparisonBenchmark.java @@ -0,0 +1,168 @@ +package org.egothor.stemmer.benchmark; + +import java.io.IOException; +import java.util.concurrent.TimeUnit; + +import org.egothor.stemmer.FrequencyTrie; +import org.egothor.stemmer.PatchCommandEncoder; +import org.egothor.stemmer.ReductionMode; +import org.egothor.stemmer.StemmerPatchTrieLoader; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Param; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.Warmup; +import org.openjdk.jmh.infra.Blackhole; +import org.tartarus.snowball.ext.englishStemmer; +import org.tartarus.snowball.ext.porterStemmer; + +/** + * Compares English stemming throughput across Radixor and Snowball stemmers. + * + *

+ * The benchmark processes the same deterministic token array with: + *

+ *
    + *
  • Radixor using bundled + * {@link StemmerPatchTrieLoader.Language#US_UK_PROFI}
  • + *
  • Snowball original Porter stemmer
  • + *
  • Snowball English stemmer, commonly referred to as Porter2
  • + *
+ * + *

+ * This benchmark compares throughput on a shared workload. It does not imply + * that the algorithms are linguistically equivalent. + *

+ */ +@BenchmarkMode(Mode.AverageTime) +@OutputTimeUnit(TimeUnit.NANOSECONDS) +@Warmup(iterations = 3, time = 1) +@Measurement(iterations = 5, time = 1) +public class EnglishStemmerComparisonBenchmark { + + /** + * Shared benchmark data. + */ + @State(Scope.Benchmark) + public static class SharedState { + + /** + * Number of generated lexical families. + */ + @Param({ "1000", "5000" }) + public int familyCount; + + /** + * Token workload processed by all compared stemmers. + */ + private String[] tokens; + + /** + * Radixor trie loaded from the bundled professional English dictionary. + */ + private FrequencyTrie radixorTrie; + + /** + * Initializes the shared benchmark state. + * + * @throws IOException if the bundled Radixor dictionary cannot be loaded + */ + @Setup(Level.Trial) + public void setUp() throws IOException { + this.tokens = EnglishComparisonCorpus.createTokens(this.familyCount); + this.radixorTrie = StemmerPatchTrieLoader.load(StemmerPatchTrieLoader.Language.US_UK_PROFI, true, + ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS); + } + } + + /** + * Per-thread reusable Snowball stemmers. + */ + @State(Scope.Thread) + public static class SnowballState { + + /** + * Adapter for the original Porter stemmer. + */ + private SnowballStemmerAdapter porterStemmer; + + /** + * Adapter for the Snowball English stemmer. + */ + private SnowballStemmerAdapter englishStemmer; + + /** + * Initializes reusable Snowball stemmers for the executing thread. + */ + @Setup(Level.Trial) + public void setUp() { + this.porterStemmer = new SnowballStemmerAdapter(porterStemmer::new); + this.englishStemmer = new SnowballStemmerAdapter(englishStemmer::new); + } + } + + /** + * Measures Radixor preferred-result stemming throughput. + * + * @param sharedState shared benchmark data + * @param blackhole sink preventing dead-code elimination + */ + @Benchmark + public void radixorUsUkProfiPreferredStem(final SharedState sharedState, final Blackhole blackhole) { + final String[] tokens = sharedState.tokens; + final FrequencyTrie trie = sharedState.radixorTrie; + + for (String token : tokens) { + final String patch = trie.get(token); + final String stem = patch == null ? token : PatchCommandEncoder.apply(token, patch); + blackhole.consume(stem); + } + } + + /** + * Measures Snowball original Porter stemming throughput. + * + * @param sharedState shared benchmark data + * @param snowballState reusable Snowball stemmers + * @param blackhole sink preventing dead-code elimination + */ + @Benchmark + public void snowballOriginalPorter(final SharedState sharedState, final SnowballState snowballState, + final Blackhole blackhole) { + final String[] tokens = sharedState.tokens; + final SnowballStemmerAdapter stemmer = snowballState.porterStemmer; + + for (String token : tokens) { + blackhole.consume(stemmer.stem(token)); + } + } + + /** + * Measures Snowball English stemming throughput. + * + *

+ * Snowball English is the newer English stemmer commonly referred to as + * Porter2. + *

+ * + * @param sharedState shared benchmark data + * @param snowballState reusable Snowball stemmers + * @param blackhole sink preventing dead-code elimination + */ + @Benchmark + public void snowballEnglishPorter2(final SharedState sharedState, final SnowballState snowballState, + final Blackhole blackhole) { + final String[] tokens = sharedState.tokens; + final SnowballStemmerAdapter stemmer = snowballState.englishStemmer; + + for (String token : tokens) { + blackhole.consume(stemmer.stem(token)); + } + } +} \ No newline at end of file diff --git a/src/jmh/java/org/egothor/stemmer/benchmark/SnowballStemmerAdapter.java b/src/jmh/java/org/egothor/stemmer/benchmark/SnowballStemmerAdapter.java new file mode 100644 index 0000000..0ff678c --- /dev/null +++ b/src/jmh/java/org/egothor/stemmer/benchmark/SnowballStemmerAdapter.java @@ -0,0 +1,57 @@ +package org.egothor.stemmer.benchmark; + +import java.util.Objects; + +import org.tartarus.snowball.SnowballStemmer; + +/** + * Small adapter around a Snowball stemmer instance used by benchmarks. + * + *

+ * The adapter keeps the benchmark code focused on the actual workload while + * still allowing a professional separation between benchmark orchestration and + * third-party stemming API calls. + *

+ */ +final class SnowballStemmerAdapter { + + /** + * Factory of Snowball stemmer instances. + */ + @FunctionalInterface + interface Factory { + + /** + * Creates a new Snowball stemmer instance. + * + * @return new Snowball stemmer + */ + SnowballStemmer create(); + } + + /** + * Reusable Snowball stemmer instance. + */ + private final SnowballStemmer stemmer; + + /** + * Creates a new adapter. + * + * @param factory factory creating the concrete Snowball stemmer + */ + SnowballStemmerAdapter(final Factory factory) { + this.stemmer = Objects.requireNonNull(factory, "factory").create(); + } + + /** + * Applies stemming to the supplied token. + * + * @param token input token + * @return produced stem + */ + String stem(final String token) { + this.stemmer.setCurrent(token); + this.stemmer.stem(); + return this.stemmer.getCurrent(); + } +} \ No newline at end of file