Files
Radixor/docs/benchmarking.md

133 lines
4.2 KiB
Markdown

# Benchmarking
Radixor includes a JMH benchmark suite for both the internal algorithmic core and a side-by-side English comparison against the Snowball Porter stemmer family.
This document explains what is benchmarked, how to run it, and how to interpret the results responsibly.
## Scope
The benchmark suite currently covers two categories:
- Radixor core operations
- English stemmer comparison on the same token workload
The comparison benchmark processes the same deterministic English token stream through:
- Radixor with bundled `US_UK_PROFI`
- Snowball original Porter
- Snowball English, commonly referred to as Porter2
The purpose of the comparison is throughput measurement on identical input. It is not intended to prove linguistic equivalence between the compared stemmers.
## Current snapshot
A recent JMH run on JDK 21.0.10 with JMH 1.37, one thread, three warmup iterations, and five measurement iterations produced the following approximate throughput ranges:
| Workload | Radixor `US_UK_PROFI` | Snowball Porter | Snowball English |
| --- | ---: | ---: | ---: |
| About 12,000 generated tokens | 30.99 M tokens/s | 8.21 M tokens/s | 5.46 M tokens/s |
| About 60,000 generated tokens | 32.25 M tokens/s | 8.02 M tokens/s | 5.11 M tokens/s |
On that workload, Radixor is approximately:
- 4 times faster than Snowball original Porter
- 6 times faster than Snowball English
These values are workload- and environment-dependent. Treat them as measured results for the documented benchmark setup, not as universal constants.
## Benchmark classes
The main benchmark classes are under `src/jmh/java/org/egothor/stemmer/benchmark`.
Relevant classes include:
- `FrequencyTrieLookupBenchmark`
- `FrequencyTrieCompilationBenchmark`
- `EnglishStemmerComparisonBenchmark`
The English comparison benchmark uses the bundled Radixor English resource and the official Snowball Java distribution integrated into the JMH source set.
## Workload design
The English comparison benchmark uses a deterministic generated corpus rather than an uncontrolled ad hoc text sample.
The workload intentionally mixes:
- simple inflections
- common derivational forms
- US and UK spelling families
- lexical forms appropriate for `US_UK_PROFI`
This design keeps runs reproducible across environments and avoids accidental drift caused by changing external corpora.
## Running benchmarks
Run the full benchmark suite:
```bash
./gradlew jmh
```
Run only the English comparison benchmark:
```bash
./gradlew jmh -Pjmh.includes=EnglishStemmerComparisonBenchmark
```
## Generated reports
JMH reports are written to:
- `build/reports/jmh/jmh-results.txt`
- `build/reports/jmh/jmh-results.csv`
The text report is convenient for human review. The CSV report is more useful for CI archiving, historical tracking, and external processing.
## Interpreting results
Benchmark numbers should be read with care.
Important factors include:
- CPU model and frequency behavior
- thermal throttling
- JVM vendor and version
- system background load
- operating-system scheduling noise
- benchmark parameter changes
For meaningful comparison, keep these stable:
- hardware or VM class
- JDK version
- benchmark parameters
- thread count
- benchmark source revision
If a regression is suspected, repeat the run and compare against the previous CSV output rather than relying on a single measurement.
## Regression tracking
The recommended regression workflow is:
1. archive `jmh-results.csv`
2. compare the same benchmark names across runs
3. compare only like-for-like environments
4. investigate sustained regressions rather than one-off noise
For public reporting, the README should keep only the condensed benchmark summary, while detailed benchmark methodology and interpretation should remain in this document.
## Notes on comparison fairness
Radixor, Snowball Porter, and Snowball English are not the same kind of stemmer.
Radixor uses a compiled patch-command trie driven by dictionary data. Snowball Porter and Snowball English are rule-based English stemmers.
Because of that, the comparison should be understood as:
- equal input workload
- different stemming strategies
- measured throughput, not semantic identity
That distinction matters whenever performance claims are discussed in documentation or release notes.