feat: add JMH comparison benchmarks for Radixor vs Snowball Porter stemmers
build: isolate Snowball benchmark integration into dedicated Gradle script docs: highlight benchmarked throughput advantage in README docs: add detailed benchmarking guide and execution notes
This commit is contained in:
@@ -26,6 +26,13 @@
|
|||||||
<attribute name="test" value="true"/>
|
<attribute name="test" value="true"/>
|
||||||
</attributes>
|
</attributes>
|
||||||
</classpathentry>
|
</classpathentry>
|
||||||
|
<classpathentry kind="src" output="bin/jmh" path="build/third-party/snowball/source/libstemmer_java-3.0.1/java">
|
||||||
|
<attributes>
|
||||||
|
<attribute name="gradle_scope" value="jmh"/>
|
||||||
|
<attribute name="gradle_used_by_scope" value="jmh"/>
|
||||||
|
<attribute name="test" value="true"/>
|
||||||
|
</attributes>
|
||||||
|
</classpathentry>
|
||||||
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-21/"/>
|
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-21/"/>
|
||||||
<classpathentry kind="con" path="org.eclipse.buildship.core.gradleclasspathcontainer"/>
|
<classpathentry kind="con" path="org.eclipse.buildship.core.gradleclasspathcontainer"/>
|
||||||
<classpathentry kind="output" path="bin/default"/>
|
<classpathentry kind="output" path="bin/default"/>
|
||||||
|
|||||||
39
README.md
39
README.md
@@ -2,10 +2,20 @@
|
|||||||
|
|
||||||
# Radixor
|
# Radixor
|
||||||
|
|
||||||
*Fast algorithmic stemming with compact patch-command tries.*
|
*Fast algorithmic stemming with compact patch-command tries — measured at about 4× to 6× the throughput of the Snowball Porter stemmer family on the current English benchmark workload.*
|
||||||
|
|
||||||
**Radixor** is a fast, algorithmic stemming toolkit for Java, built around compact **patch-command tries** in the tradition of the original **Egothor** stemmer.
|
**Radixor** is a fast, algorithmic stemming toolkit for Java, built around compact **patch-command tries** in the tradition of the original **Egothor** stemmer.
|
||||||
|
|
||||||
|
On the current JMH English comparison benchmark, Radixor with bundled `US_UK_PROFI`
|
||||||
|
reaches approximately **31 to 32 million tokens per second**, compared with about
|
||||||
|
**8 million tokens per second** for Snowball original Porter and about
|
||||||
|
**5 to 5.5 million tokens per second** for Snowball English (Porter2).
|
||||||
|
|
||||||
|
That means the current Radixor implementation is approximately:
|
||||||
|
|
||||||
|
- **4× faster** than Snowball original Porter
|
||||||
|
- **6× faster** than Snowball English (Porter2)
|
||||||
|
|
||||||
It is designed for production search and text-processing systems that need stemming which is:
|
It is designed for production search and text-processing systems that need stemming which is:
|
||||||
|
|
||||||
- fast at runtime
|
- fast at runtime
|
||||||
@@ -22,6 +32,7 @@ Radixor keeps the valuable core of the original Egothor idea, modernizes the imp
|
|||||||
- [Heritage](#heritage)
|
- [Heritage](#heritage)
|
||||||
- [What Radixor adds](#what-radixor-adds)
|
- [What Radixor adds](#what-radixor-adds)
|
||||||
- [Key features](#key-features)
|
- [Key features](#key-features)
|
||||||
|
- [Performance](#performance)
|
||||||
- [Documentation](#documentation)
|
- [Documentation](#documentation)
|
||||||
- [Project philosophy](#project-philosophy)
|
- [Project philosophy](#project-philosophy)
|
||||||
- [Historical note](#historical-note)
|
- [Historical note](#historical-note)
|
||||||
@@ -37,7 +48,7 @@ This gives you a stemmer that is:
|
|||||||
- compact enough for deployment-friendly binary artifacts
|
- compact enough for deployment-friendly binary artifacts
|
||||||
- suitable for both offline compilation and runtime loading
|
- suitable for both offline compilation and runtime loading
|
||||||
|
|
||||||
Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer.
|
Radixor is especially attractive when you want something more adaptable than simple suffix stripping, but much smaller and easier to operate than a full morphological analyzer. In the current English benchmark comparison against the Snowball Porter stemmer family, it also delivers a substantial throughput advantage.
|
||||||
|
|
||||||
## Heritage
|
## Heritage
|
||||||
|
|
||||||
@@ -95,6 +106,27 @@ Compared with the historical baseline, Radixor emphasizes:
|
|||||||
- Bundled language resources
|
- Bundled language resources
|
||||||
- Support for extending compiled stemmer tables
|
- Support for extending compiled stemmer tables
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
Radixor includes a JMH benchmark suite for both its own algorithmic core and a
|
||||||
|
side-by-side comparison against the Snowball Porter stemmer family.
|
||||||
|
|
||||||
|
On the current English comparison workload, Radixor with bundled `US_UK_PROFI`
|
||||||
|
reaches approximately **31 to 32 million tokens per second**. Snowball original
|
||||||
|
Porter reaches approximately **8 million tokens per second**, and Snowball
|
||||||
|
English (Porter2) approximately **5 to 5.5 million tokens per second**.
|
||||||
|
|
||||||
|
That places Radixor at approximately **4× the throughput of Snowball original Porter**
|
||||||
|
and approximately **6× the throughput of Snowball English (Porter2)**
|
||||||
|
on the current benchmark workload.
|
||||||
|
|
||||||
|
This is a throughput comparison on the same deterministic token stream. It is
|
||||||
|
not a claim that the compared stemmers are linguistically equivalent or
|
||||||
|
interchangeable.
|
||||||
|
|
||||||
|
For benchmark scope, workload design, environment, commands, report locations,
|
||||||
|
and interpretation guidance, see [Benchmarking](docs/benchmarking.md).
|
||||||
|
|
||||||
## Documentation
|
## Documentation
|
||||||
|
|
||||||
The repository keeps the front page concise and places detailed documentation under `docs/`.
|
The repository keeps the front page concise and places detailed documentation under `docs/`.
|
||||||
@@ -122,6 +154,9 @@ Start here:
|
|||||||
- [Quality and Operations](docs/quality-and-operations.md)
|
- [Quality and Operations](docs/quality-and-operations.md)
|
||||||
Testing, persistence, deployment, and operational guidance.
|
Testing, persistence, deployment, and operational guidance.
|
||||||
|
|
||||||
|
- [Benchmarking](docs/benchmarking.md)
|
||||||
|
JMH benchmark design, Snowball comparison, execution, and interpretation.
|
||||||
|
|
||||||
## Project philosophy
|
## Project philosophy
|
||||||
|
|
||||||
Radixor does not preserve historical complexity for its own sake.
|
Radixor does not preserve historical complexity for its own sake.
|
||||||
|
|||||||
BIN
Radixor.png
BIN
Radixor.png
Binary file not shown.
|
Before Width: | Height: | Size: 1.5 MiB After Width: | Height: | Size: 318 KiB |
@@ -125,7 +125,7 @@ jmh {
|
|||||||
|
|
||||||
tasks.named('jmh') {
|
tasks.named('jmh') {
|
||||||
group = 'verification'
|
group = 'verification'
|
||||||
description = 'Runs JMH benchmarks for the Radixor algorithmic core.'
|
description = 'Runs JMH benchmarks for the Radixor algorithmic core and Snowball comparison suite.'
|
||||||
}
|
}
|
||||||
|
|
||||||
javadoc {
|
javadoc {
|
||||||
@@ -154,6 +154,8 @@ javadoc {
|
|||||||
source = sourceSets.main.allJava
|
source = sourceSets.main.allJava
|
||||||
}
|
}
|
||||||
|
|
||||||
|
apply from: 'gradle/snowball-benchmarks.gradle'
|
||||||
|
|
||||||
gradle.taskGraph.whenReady { taskGraph ->
|
gradle.taskGraph.whenReady { taskGraph ->
|
||||||
def banner = """
|
def banner = """
|
||||||
\u001B[34m
|
\u001B[34m
|
||||||
|
|||||||
134
docs/benchmarking.md
Normal file
134
docs/benchmarking.md
Normal file
@@ -0,0 +1,134 @@
|
|||||||
|
# Benchmarking
|
||||||
|
|
||||||
|
> ← Back to [README.md](../README.md)
|
||||||
|
|
||||||
|
Radixor includes a JMH benchmark suite for both the internal algorithmic core and a side-by-side English comparison against the Snowball Porter stemmer family.
|
||||||
|
|
||||||
|
This document explains what is benchmarked, how to run it, and how to interpret the results responsibly.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
The benchmark suite currently covers two categories:
|
||||||
|
|
||||||
|
- Radixor core operations
|
||||||
|
- English stemmer comparison on the same token workload
|
||||||
|
|
||||||
|
The comparison benchmark processes the same deterministic English token stream through:
|
||||||
|
|
||||||
|
- Radixor with bundled `US_UK_PROFI`
|
||||||
|
- Snowball original Porter
|
||||||
|
- Snowball English, commonly referred to as Porter2
|
||||||
|
|
||||||
|
The purpose of the comparison is throughput measurement on identical input. It is not intended to prove linguistic equivalence between the compared stemmers.
|
||||||
|
|
||||||
|
## Current snapshot
|
||||||
|
|
||||||
|
A recent JMH run on JDK 21.0.10 with JMH 1.37, one thread, three warmup iterations, and five measurement iterations produced the following approximate throughput ranges:
|
||||||
|
|
||||||
|
| Workload | Radixor `US_UK_PROFI` | Snowball Porter | Snowball English |
|
||||||
|
| --- | ---: | ---: | ---: |
|
||||||
|
| About 12,000 generated tokens | 30.99 M tokens/s | 8.21 M tokens/s | 5.46 M tokens/s |
|
||||||
|
| About 60,000 generated tokens | 32.25 M tokens/s | 8.02 M tokens/s | 5.11 M tokens/s |
|
||||||
|
|
||||||
|
On that workload, Radixor is approximately:
|
||||||
|
|
||||||
|
- 4 times faster than Snowball original Porter
|
||||||
|
- 6 times faster than Snowball English
|
||||||
|
|
||||||
|
These values are workload- and environment-dependent. Treat them as measured results for the documented benchmark setup, not as universal constants.
|
||||||
|
|
||||||
|
## Benchmark classes
|
||||||
|
|
||||||
|
The main benchmark classes are under `src/jmh/java/org/egothor/stemmer/benchmark`.
|
||||||
|
|
||||||
|
Relevant classes include:
|
||||||
|
|
||||||
|
- `FrequencyTrieLookupBenchmark`
|
||||||
|
- `FrequencyTrieCompilationBenchmark`
|
||||||
|
- `EnglishStemmerComparisonBenchmark`
|
||||||
|
|
||||||
|
The English comparison benchmark uses the bundled Radixor English resource and the official Snowball Java distribution integrated into the JMH source set.
|
||||||
|
|
||||||
|
## Workload design
|
||||||
|
|
||||||
|
The English comparison benchmark uses a deterministic generated corpus rather than an uncontrolled ad hoc text sample.
|
||||||
|
|
||||||
|
The workload intentionally mixes:
|
||||||
|
|
||||||
|
- simple inflections
|
||||||
|
- common derivational forms
|
||||||
|
- US and UK spelling families
|
||||||
|
- lexical forms appropriate for `US_UK_PROFI`
|
||||||
|
|
||||||
|
This design keeps runs reproducible across environments and avoids accidental drift caused by changing external corpora.
|
||||||
|
|
||||||
|
## Running benchmarks
|
||||||
|
|
||||||
|
Run the full benchmark suite:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./gradlew jmh
|
||||||
|
```
|
||||||
|
|
||||||
|
Run only the English comparison benchmark:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./gradlew jmh -Pjmh.includes=EnglishStemmerComparisonBenchmark
|
||||||
|
```
|
||||||
|
|
||||||
|
## Generated reports
|
||||||
|
|
||||||
|
JMH reports are written to:
|
||||||
|
|
||||||
|
- `build/reports/jmh/jmh-results.txt`
|
||||||
|
- `build/reports/jmh/jmh-results.csv`
|
||||||
|
|
||||||
|
The text report is convenient for human review. The CSV report is more useful for CI archiving, historical tracking, and external processing.
|
||||||
|
|
||||||
|
## Interpreting results
|
||||||
|
|
||||||
|
Benchmark numbers should be read with care.
|
||||||
|
|
||||||
|
Important factors include:
|
||||||
|
|
||||||
|
- CPU model and frequency behavior
|
||||||
|
- thermal throttling
|
||||||
|
- JVM vendor and version
|
||||||
|
- system background load
|
||||||
|
- operating-system scheduling noise
|
||||||
|
- benchmark parameter changes
|
||||||
|
|
||||||
|
For meaningful comparison, keep these stable:
|
||||||
|
|
||||||
|
- hardware or VM class
|
||||||
|
- JDK version
|
||||||
|
- benchmark parameters
|
||||||
|
- thread count
|
||||||
|
- benchmark source revision
|
||||||
|
|
||||||
|
If a regression is suspected, repeat the run and compare against the previous CSV output rather than relying on a single measurement.
|
||||||
|
|
||||||
|
## Regression tracking
|
||||||
|
|
||||||
|
The recommended regression workflow is:
|
||||||
|
|
||||||
|
1. archive `jmh-results.csv`
|
||||||
|
2. compare the same benchmark names across runs
|
||||||
|
3. compare only like-for-like environments
|
||||||
|
4. investigate sustained regressions rather than one-off noise
|
||||||
|
|
||||||
|
For public reporting, the README should keep only the condensed benchmark summary, while detailed benchmark methodology and interpretation should remain in this document.
|
||||||
|
|
||||||
|
## Notes on comparison fairness
|
||||||
|
|
||||||
|
Radixor, Snowball Porter, and Snowball English are not the same kind of stemmer.
|
||||||
|
|
||||||
|
Radixor uses a compiled patch-command trie driven by dictionary data. Snowball Porter and Snowball English are rule-based English stemmers.
|
||||||
|
|
||||||
|
Because of that, the comparison should be understood as:
|
||||||
|
|
||||||
|
- equal input workload
|
||||||
|
- different stemming strategies
|
||||||
|
- measured throughput, not semantic identity
|
||||||
|
|
||||||
|
That distinction matters whenever performance claims are discussed in documentation or release notes.
|
||||||
49
gradle/snowball-benchmarks.gradle
Normal file
49
gradle/snowball-benchmarks.gradle
Normal file
@@ -0,0 +1,49 @@
|
|||||||
|
def snowballVersion = '3.0.1'
|
||||||
|
def snowballArchiveName = "libstemmer_java-${snowballVersion}.tar.gz"
|
||||||
|
def snowballDownloadUrl = "https://snowballstem.org/dist/${snowballArchiveName}"
|
||||||
|
def snowballDownloadFile = layout.buildDirectory.file("third-party/snowball/${snowballArchiveName}")
|
||||||
|
def snowballExtractDirectory = layout.buildDirectory.dir('third-party/snowball/source')
|
||||||
|
def snowballJavaSourceDirectory = layout.buildDirectory.dir(
|
||||||
|
"third-party/snowball/source/libstemmer_java-${snowballVersion}/java")
|
||||||
|
|
||||||
|
tasks.register('downloadSnowballJava') {
|
||||||
|
group = 'build setup'
|
||||||
|
description = 'Downloads the official Snowball Java source distribution for benchmark-only use.'
|
||||||
|
|
||||||
|
outputs.file(snowballDownloadFile)
|
||||||
|
|
||||||
|
doLast {
|
||||||
|
File targetFile = snowballDownloadFile.get().asFile
|
||||||
|
targetFile.parentFile.mkdirs()
|
||||||
|
|
||||||
|
if (!targetFile.exists()) {
|
||||||
|
new URL(snowballDownloadUrl).withInputStream { inputStream ->
|
||||||
|
targetFile.withOutputStream { outputStream ->
|
||||||
|
outputStream << inputStream
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
tasks.register('extractSnowballJava', Copy) {
|
||||||
|
group = 'build setup'
|
||||||
|
description = 'Extracts the official Snowball Java source distribution.'
|
||||||
|
|
||||||
|
dependsOn(tasks.named('downloadSnowballJava'))
|
||||||
|
|
||||||
|
from(tarTree(resources.gzip(snowballDownloadFile)))
|
||||||
|
into(snowballExtractDirectory)
|
||||||
|
}
|
||||||
|
|
||||||
|
sourceSets {
|
||||||
|
jmh {
|
||||||
|
java {
|
||||||
|
srcDir(snowballJavaSourceDirectory)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
tasks.named('compileJmhJava') {
|
||||||
|
dependsOn(tasks.named('extractSnowballJava'))
|
||||||
|
}
|
||||||
@@ -0,0 +1,110 @@
|
|||||||
|
package org.egothor.stemmer.benchmark;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Locale;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Builds a deterministic English token corpus for side-by-side stemming
|
||||||
|
* benchmarks.
|
||||||
|
*
|
||||||
|
* <p>
|
||||||
|
* The generated corpus mixes:
|
||||||
|
* </p>
|
||||||
|
* <ul>
|
||||||
|
* <li>simple inflections</li>
|
||||||
|
* <li>common derivational forms</li>
|
||||||
|
* <li>US/UK spelling families</li>
|
||||||
|
* <li>forms that are suitable for comparison against the bundled
|
||||||
|
* {@code US_UK_PROFI} Radixor dictionary</li>
|
||||||
|
* </ul>
|
||||||
|
*
|
||||||
|
* <p>
|
||||||
|
* The goal is not to simulate natural language frequency distribution exactly,
|
||||||
|
* but to provide a stable and reproducible comparison workload for benchmark
|
||||||
|
* runs and regression tracking.
|
||||||
|
* </p>
|
||||||
|
*/
|
||||||
|
final class EnglishComparisonCorpus {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Canonical lexical bases used to generate the token workload.
|
||||||
|
*/
|
||||||
|
private static final String[] BASES = { "analyze", "analyse", "color", "colour", "center", "centre", "organize",
|
||||||
|
"organise", "optimize", "optimise", "characterize", "characterise", "connect", "construct", "compute",
|
||||||
|
"design", "develop", "engineer", "govern", "improve", "index", "inform", "manage", "model", "observe",
|
||||||
|
"operate", "perform", "predict", "prepare", "process", "project", "protect", "publish", "query", "reduce",
|
||||||
|
"refresh", "render", "resolve", "return", "search", "select", "signal", "store", "structure", "support",
|
||||||
|
"transform", "update", "validate", "value" };
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Utility class.
|
||||||
|
*/
|
||||||
|
private EnglishComparisonCorpus() {
|
||||||
|
throw new AssertionError("No instances.");
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Creates a deterministic token corpus for English stemming comparison.
|
||||||
|
*
|
||||||
|
* @param familyCount number of generated lexical families
|
||||||
|
* @return token array in stable order
|
||||||
|
*/
|
||||||
|
static String[] createTokens(final int familyCount) {
|
||||||
|
if (familyCount < 1) {
|
||||||
|
throw new IllegalArgumentException("familyCount must be at least 1.");
|
||||||
|
}
|
||||||
|
|
||||||
|
final List<String> tokens = new ArrayList<>(familyCount * 14);
|
||||||
|
|
||||||
|
for (int index = 0; index < familyCount; index++) {
|
||||||
|
final String base = createBase(index);
|
||||||
|
|
||||||
|
tokens.add(base);
|
||||||
|
tokens.add(base + "s");
|
||||||
|
tokens.add(base + "ed");
|
||||||
|
tokens.add(base + "ing");
|
||||||
|
tokens.add(base + "er");
|
||||||
|
tokens.add(base + "ers");
|
||||||
|
tokens.add(base + "ly");
|
||||||
|
tokens.add(base + "ness");
|
||||||
|
tokens.add(base + "ment");
|
||||||
|
tokens.add(base + "ments");
|
||||||
|
tokens.add(base + "able");
|
||||||
|
tokens.add(base + "ability");
|
||||||
|
|
||||||
|
if (base.endsWith("ize")) {
|
||||||
|
tokens.add(base.substring(0, base.length() - 3) + "isation");
|
||||||
|
tokens.add(base.substring(0, base.length() - 3) + "ised");
|
||||||
|
}
|
||||||
|
|
||||||
|
if (base.endsWith("ise")) {
|
||||||
|
tokens.add(base.substring(0, base.length() - 3) + "ization");
|
||||||
|
tokens.add(base.substring(0, base.length() - 3) + "ized");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return tokens.toArray(String[]::new);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Creates one deterministic base token.
|
||||||
|
*
|
||||||
|
* @param index base ordinal
|
||||||
|
* @return generated lexical base
|
||||||
|
*/
|
||||||
|
private static String createBase(final int index) {
|
||||||
|
return (BASES[index % BASES.length] + suffix(index)).toLowerCase(Locale.ROOT);
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Creates a compact discriminator suffix so that large corpora remain unique
|
||||||
|
* while retaining stable lexical families.
|
||||||
|
*
|
||||||
|
* @param value ordinal value
|
||||||
|
* @return compact discriminator
|
||||||
|
*/
|
||||||
|
private static String suffix(final int value) {
|
||||||
|
return Integer.toString(value, Character.MAX_RADIX);
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,168 @@
|
|||||||
|
package org.egothor.stemmer.benchmark;
|
||||||
|
|
||||||
|
import java.io.IOException;
|
||||||
|
import java.util.concurrent.TimeUnit;
|
||||||
|
|
||||||
|
import org.egothor.stemmer.FrequencyTrie;
|
||||||
|
import org.egothor.stemmer.PatchCommandEncoder;
|
||||||
|
import org.egothor.stemmer.ReductionMode;
|
||||||
|
import org.egothor.stemmer.StemmerPatchTrieLoader;
|
||||||
|
import org.openjdk.jmh.annotations.Benchmark;
|
||||||
|
import org.openjdk.jmh.annotations.BenchmarkMode;
|
||||||
|
import org.openjdk.jmh.annotations.Level;
|
||||||
|
import org.openjdk.jmh.annotations.Measurement;
|
||||||
|
import org.openjdk.jmh.annotations.Mode;
|
||||||
|
import org.openjdk.jmh.annotations.OutputTimeUnit;
|
||||||
|
import org.openjdk.jmh.annotations.Param;
|
||||||
|
import org.openjdk.jmh.annotations.Scope;
|
||||||
|
import org.openjdk.jmh.annotations.Setup;
|
||||||
|
import org.openjdk.jmh.annotations.State;
|
||||||
|
import org.openjdk.jmh.annotations.Warmup;
|
||||||
|
import org.openjdk.jmh.infra.Blackhole;
|
||||||
|
import org.tartarus.snowball.ext.englishStemmer;
|
||||||
|
import org.tartarus.snowball.ext.porterStemmer;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Compares English stemming throughput across Radixor and Snowball stemmers.
|
||||||
|
*
|
||||||
|
* <p>
|
||||||
|
* The benchmark processes the same deterministic token array with:
|
||||||
|
* </p>
|
||||||
|
* <ul>
|
||||||
|
* <li>Radixor using bundled
|
||||||
|
* {@link StemmerPatchTrieLoader.Language#US_UK_PROFI}</li>
|
||||||
|
* <li>Snowball original Porter stemmer</li>
|
||||||
|
* <li>Snowball English stemmer, commonly referred to as Porter2</li>
|
||||||
|
* </ul>
|
||||||
|
*
|
||||||
|
* <p>
|
||||||
|
* This benchmark compares throughput on a shared workload. It does not imply
|
||||||
|
* that the algorithms are linguistically equivalent.
|
||||||
|
* </p>
|
||||||
|
*/
|
||||||
|
@BenchmarkMode(Mode.AverageTime)
|
||||||
|
@OutputTimeUnit(TimeUnit.NANOSECONDS)
|
||||||
|
@Warmup(iterations = 3, time = 1)
|
||||||
|
@Measurement(iterations = 5, time = 1)
|
||||||
|
public class EnglishStemmerComparisonBenchmark {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Shared benchmark data.
|
||||||
|
*/
|
||||||
|
@State(Scope.Benchmark)
|
||||||
|
public static class SharedState {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Number of generated lexical families.
|
||||||
|
*/
|
||||||
|
@Param({ "1000", "5000" })
|
||||||
|
public int familyCount;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Token workload processed by all compared stemmers.
|
||||||
|
*/
|
||||||
|
private String[] tokens;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Radixor trie loaded from the bundled professional English dictionary.
|
||||||
|
*/
|
||||||
|
private FrequencyTrie<String> radixorTrie;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Initializes the shared benchmark state.
|
||||||
|
*
|
||||||
|
* @throws IOException if the bundled Radixor dictionary cannot be loaded
|
||||||
|
*/
|
||||||
|
@Setup(Level.Trial)
|
||||||
|
public void setUp() throws IOException {
|
||||||
|
this.tokens = EnglishComparisonCorpus.createTokens(this.familyCount);
|
||||||
|
this.radixorTrie = StemmerPatchTrieLoader.load(StemmerPatchTrieLoader.Language.US_UK_PROFI, true,
|
||||||
|
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Per-thread reusable Snowball stemmers.
|
||||||
|
*/
|
||||||
|
@State(Scope.Thread)
|
||||||
|
public static class SnowballState {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Adapter for the original Porter stemmer.
|
||||||
|
*/
|
||||||
|
private SnowballStemmerAdapter porterStemmer;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Adapter for the Snowball English stemmer.
|
||||||
|
*/
|
||||||
|
private SnowballStemmerAdapter englishStemmer;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Initializes reusable Snowball stemmers for the executing thread.
|
||||||
|
*/
|
||||||
|
@Setup(Level.Trial)
|
||||||
|
public void setUp() {
|
||||||
|
this.porterStemmer = new SnowballStemmerAdapter(porterStemmer::new);
|
||||||
|
this.englishStemmer = new SnowballStemmerAdapter(englishStemmer::new);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Measures Radixor preferred-result stemming throughput.
|
||||||
|
*
|
||||||
|
* @param sharedState shared benchmark data
|
||||||
|
* @param blackhole sink preventing dead-code elimination
|
||||||
|
*/
|
||||||
|
@Benchmark
|
||||||
|
public void radixorUsUkProfiPreferredStem(final SharedState sharedState, final Blackhole blackhole) {
|
||||||
|
final String[] tokens = sharedState.tokens;
|
||||||
|
final FrequencyTrie<String> trie = sharedState.radixorTrie;
|
||||||
|
|
||||||
|
for (String token : tokens) {
|
||||||
|
final String patch = trie.get(token);
|
||||||
|
final String stem = patch == null ? token : PatchCommandEncoder.apply(token, patch);
|
||||||
|
blackhole.consume(stem);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Measures Snowball original Porter stemming throughput.
|
||||||
|
*
|
||||||
|
* @param sharedState shared benchmark data
|
||||||
|
* @param snowballState reusable Snowball stemmers
|
||||||
|
* @param blackhole sink preventing dead-code elimination
|
||||||
|
*/
|
||||||
|
@Benchmark
|
||||||
|
public void snowballOriginalPorter(final SharedState sharedState, final SnowballState snowballState,
|
||||||
|
final Blackhole blackhole) {
|
||||||
|
final String[] tokens = sharedState.tokens;
|
||||||
|
final SnowballStemmerAdapter stemmer = snowballState.porterStemmer;
|
||||||
|
|
||||||
|
for (String token : tokens) {
|
||||||
|
blackhole.consume(stemmer.stem(token));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Measures Snowball English stemming throughput.
|
||||||
|
*
|
||||||
|
* <p>
|
||||||
|
* Snowball English is the newer English stemmer commonly referred to as
|
||||||
|
* Porter2.
|
||||||
|
* </p>
|
||||||
|
*
|
||||||
|
* @param sharedState shared benchmark data
|
||||||
|
* @param snowballState reusable Snowball stemmers
|
||||||
|
* @param blackhole sink preventing dead-code elimination
|
||||||
|
*/
|
||||||
|
@Benchmark
|
||||||
|
public void snowballEnglishPorter2(final SharedState sharedState, final SnowballState snowballState,
|
||||||
|
final Blackhole blackhole) {
|
||||||
|
final String[] tokens = sharedState.tokens;
|
||||||
|
final SnowballStemmerAdapter stemmer = snowballState.englishStemmer;
|
||||||
|
|
||||||
|
for (String token : tokens) {
|
||||||
|
blackhole.consume(stemmer.stem(token));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,57 @@
|
|||||||
|
package org.egothor.stemmer.benchmark;
|
||||||
|
|
||||||
|
import java.util.Objects;
|
||||||
|
|
||||||
|
import org.tartarus.snowball.SnowballStemmer;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Small adapter around a Snowball stemmer instance used by benchmarks.
|
||||||
|
*
|
||||||
|
* <p>
|
||||||
|
* The adapter keeps the benchmark code focused on the actual workload while
|
||||||
|
* still allowing a professional separation between benchmark orchestration and
|
||||||
|
* third-party stemming API calls.
|
||||||
|
* </p>
|
||||||
|
*/
|
||||||
|
final class SnowballStemmerAdapter {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Factory of Snowball stemmer instances.
|
||||||
|
*/
|
||||||
|
@FunctionalInterface
|
||||||
|
interface Factory {
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Creates a new Snowball stemmer instance.
|
||||||
|
*
|
||||||
|
* @return new Snowball stemmer
|
||||||
|
*/
|
||||||
|
SnowballStemmer create();
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Reusable Snowball stemmer instance.
|
||||||
|
*/
|
||||||
|
private final SnowballStemmer stemmer;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Creates a new adapter.
|
||||||
|
*
|
||||||
|
* @param factory factory creating the concrete Snowball stemmer
|
||||||
|
*/
|
||||||
|
SnowballStemmerAdapter(final Factory factory) {
|
||||||
|
this.stemmer = Objects.requireNonNull(factory, "factory").create();
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Applies stemming to the supplied token.
|
||||||
|
*
|
||||||
|
* @param token input token
|
||||||
|
* @return produced stem
|
||||||
|
*/
|
||||||
|
String stem(final String token) {
|
||||||
|
this.stemmer.setCurrent(token);
|
||||||
|
this.stemmer.stem();
|
||||||
|
return this.stemmer.getCurrent();
|
||||||
|
}
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user