feat: implement dense-child optimized trie lookup and enterprise test/CI profile hardening

This commit is contained in:
2026-05-16 03:24:07 +02:00
parent 50c3ab3432
commit dadab5514e
44 changed files with 2052 additions and 294 deletions

View File

@@ -0,0 +1,193 @@
# Lookup Edge Optimization
Compiled trie nodes (`CompiledNode`) use three lookup strategies when resolving child edges:
1. dense array direct lookup,
2. linear scan for very small child counts,
3. binary search over sorted edge labels.
This page explains the dense path, what `maxExpandedIndex` controls, and how to tune it.
## Runtime model of one node
For a node with sorted edge labels `char[] edges`, the implementation can materialize an
index-aligned dense table when labels occupy a small compact code-point interval:
```text
span = maxEdge - minEdge
use dense table iff (span <= maxExpandedIndex) and (maxExpandedIndex > 0)
```
When dense lookup is used, lookup is constant-time indexing:
```text
denseIndex = requestedEdge - minEdge
return denseChildren[denseIndex] // or null if outside interval
```
When dense lookup is not active (interval is too wide or the configured
`maxExpandedIndex` is `0`), `CompiledNode` still chooses between two fallback
strategies:
- **linear scan** for very small child counts (`4` or fewer children),
- **binary search** for larger child counts.
This means the fallback method is selected by child count, not by “distance” alone.
`linear scan` is therefore used when there are only a few edges even if those edges are
spread across very distant code points.
### Example: few edges, wide Unicode span
```text
edges = ['a', '中', '你']
edge count = 3
minEdge = 'a' (U+0061)
maxEdge = '你' (U+4F60)
span = 20319
```
- If `maxExpandedIndex = 512`, dense indexing is not used because `span > maxExpandedIndex`.
- Because `edge count = 3` (<= 4), lookup falls back to a tiny linear scan of the
three labels.
- This is exactly the case where you get benefit from the threshold even though the interval is wide.
This is useful for non-Latin scripts as well: what matters is interval width in Unicode
code points, not script name. A compact Arabic-range block can still benefit from dense
lookups when keys stay in a tight code-point interval.
## Why this is configurable
`maxExpandedIndex` is only a performance/paging choice:
- higher value:
- more compact intervals qualify for dense tables,
- more constant-time child lookup,
- more memory for dense tables in qualifying nodes.
- lower value (or `0`):
- less dense-table allocation,
- fewer branches into constant-time path,
- lower materialization memory.
The value never changes lookup semantics. It only changes the in-memory structure shape.
## Persistence and loading model
This threshold is **not** stored in `TrieMetadata`.
- The binary format stores only trie payload and semantic metadata (`reduction`, `traversal`,
case/diacritic settings, and stream version).
- `maxExpandedIndex` is chosen when materializing nodes in memory.
- You can therefore keep one persisted artifact and load it with different in-memory
trade-offs depending on deployment constraints.
## Default
- `FrequencyTrie.DEFAULT_MAX_EXPANDED_INDEX == 512`
- `CompiledNode.DEFAULT_MAX_EXPANDED_INDEX == 512`
These are practical defaults for mixed-language text and Latin-like scripts where edge labels
often cluster.
## Tune during build (writable phase)
Use the full `FrequencyTrie.Builder` constructor when you are compiling from source data.
The builder threshold is applied while freezing reduced nodes into the immutable form.
```java
import org.egothor.stemmer.CaseProcessingMode;
import org.egothor.stemmer.DiacriticProcessingMode;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.ReductionSettings;
import org.egothor.stemmer.WordTraversalDirection;
final ReductionSettings settings = ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
final FrequencyTrie.Builder<String> fastBuilder =
new FrequencyTrie.Builder<>(String[]::new,
settings,
WordTraversalDirection.BACKWARD,
CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT,
DiacriticProcessingMode.AS_IS,
1024); // prefer lookup speed
// ... put(...) ...
final FrequencyTrie<String> trie = fastBuilder.build();
```
Use `0` or `256` for lower memory while still building larger tries.
```java
final FrequencyTrie.Builder<String> compactBuilder =
new FrequencyTrie.Builder<>(String[]::new,
settings,
WordTraversalDirection.BACKWARD,
CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT,
DiacriticProcessingMode.AS_IS,
256); // lower memory profile
```
## Tune when loading a binary artifact (runtime phase)
At artifact load time, you can tune the same trade-off independently of persisted metadata.
```java
import java.nio.file.Path;
import org.egothor.stemmer.StemmerPatchTrieLoader;
var defaultLookup = StemmerPatchTrieLoader.loadBinary(
Path.of("stemmers", "english.radixor.gz"));
var fastLookup = StemmerPatchTrieLoader.loadBinary(
Path.of("stemmers", "english.radixor.gz"), 1024);
var compactLookup = StemmerPatchTrieLoader.loadBinary(
Path.of("stemmers", "english.radixor.gz"), 0);
```
You can also set the threshold directly with `FrequencyTrie.readFrom(...)` when reading streams:
```java
import java.io.DataInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.zip.GZIPInputStream;
import org.egothor.stemmer.FrequencyTrie;
public final class StreamLoadExample {
private StreamLoadExample() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
try (InputStream fileInput = Files.newInputStream(Path.of("stemmers", "english.radixor.gz"));
GZIPInputStream gzip = new GZIPInputStream(fileInput);
DataInputStream dataInput = new DataInputStream(gzip)) {
final FrequencyTrie<String> compactOnLoad = FrequencyTrie.readFrom(
dataInput,
String[]::new,
input -> input.readUTF(),
256);
}
}
}
```
Note: the string codec is intentionally inline in this snippet to keep it self-contained.
## Practical guidance
- Start with default (`512`) in production and profile before changing it.
- Use `0` when memory is the priority and query throughput is not the bottleneck.
- Use values around `1024` for workloads dominated by compact alphabets and very hot lookups.
Trade-off expectation:
- increasing `maxExpandedIndex` improves lookup speed when edges tend to occupy short spans,
- decreasing it reduces per-node auxiliary memory in dense-span nodes.

View File

@@ -87,6 +87,43 @@ public final class LoadBinaryExample {
The binary format is the native `FrequencyTrie` serialization wrapped in GZip compression. It includes persisted `TrieMetadata`, so lookup after loading uses the traversal, case-processing, diacritic-processing, and reduction settings captured when the trie was compiled.
## Tune child lookup density when loading binaries
To optimize hot-path latency, you can tune direct child indexing by passing `maxExpandedIndex`
at load time. This does not change persisted metadata, only the materialized in-memory form.
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class LoadBinaryWithDenseLookupExample {
private LoadBinaryWithDenseLookupExample() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
final FrequencyTrie<String> balanced = StemmerPatchTrieLoader.loadBinary(
Path.of("stemmers", "english.radixor.gz"));
final FrequencyTrie<String> fast = StemmerPatchTrieLoader.loadBinary(
Path.of("stemmers", "english.radixor.gz"),
1024);
final FrequencyTrie<String> compact = StemmerPatchTrieLoader.loadBinary(
Path.of("stemmers", "english.radixor.gz"),
0);
}
}
```
Negative values still use `FrequencyTrie.DEFAULT_MAX_EXPANDED_INDEX`.
[Lookup Edge Optimization](lookup-edge-optimization.md) describes the trade-off in detail and examples for build-time tuning as well.
## Build directly with a mutable builder
A `FrequencyTrie.Builder<V>` accepts repeated `put(key, value)` calls and compiles the final read-only trie through `build()`. Compilation performs bottom-up reduction and produces the compact immutable runtime representation.

View File

@@ -25,6 +25,7 @@ This is why Radixor can generalize beyond explicitly listed forms and why compil
The programmatic API is easier to understand when split by developer task:
- [Loading and Building Stemmers](programmatic-loading-and-building.md) explains how to acquire a compiled stemmer from bundled resources, textual dictionaries, binary artifacts, or direct builder usage.
- [Lookup Edge Optimization](lookup-edge-optimization.md) explains dense child lookup tuning and the speed/memory trade-off when materializing compiled tries.
- [Querying and Ambiguity Handling](programmatic-querying-and-ambiguity.md) explains `get(...)`, `getAll(...)`, `getEntries(...)`, patch application, and the practical meaning of reduction modes.
- [Extending and Persisting Compiled Tries](programmatic-extending-and-persistence.md) explains how to reopen compiled tries, add new lexical data, rebuild them, and store them as binary artifacts.

View File

@@ -58,6 +58,27 @@ A deterministic system is easier to test, easier to reason about, and safer to i
The project is intended to maintain very high confidence in both core correctness and behavioral stability.
The recommended execution strategy is defined by the tagged test profiles in [Test taxonomy and execution filtering](test-taxonomy-and-filtering.md). In practice, teams can execute profile tasks directly:
- `./gradlew ciSmoke`: fast local/PR safety checks (`unit`, excluding `slow`; additionally excludes
`CompileIntegrationTest` as a defensive safeguard).
- `./gradlew ciSlow`: enterprise heavy gate for all tests marked with `slow` (typically
production dictionary and large corpus verification). This should be used for scheduled/manual
hardening gates and not in standard release build.
- `./gradlew ciCore`: behavioral coverage of trie and frequency-trie paths (`unit` + `property` where applicable)
- `./gradlew ciIntegration`: pipeline and CLI integration path checks
- `./gradlew ciCompat`: compatibility and regression verification for persisted artifacts
- `./gradlew ciRelease`: full non-slow suite for release-confidence runs (all test tags except `slow`,
plus explicit name-based exclusion of `CompileIntegrationTest*` and
`StemmerPatchTrieLoaderTest$BundledDictionaryTests*` as additional guardrails)
- `./gradlew ciNightly`: extended fuzz profile for robustness hardening
- `./gradlew ci`: umbrella profile depending on smoke/core/integration/compat
## Test taxonomy and execution filtering
The full tag taxonomy and executable filter examples are documented in
[Test taxonomy and execution filtering](test-taxonomy-and-filtering.md).
### Structural coverage
High code coverage is treated as a useful signal, but not as a sufficient goal on its own. Coverage is valuable only when the covered scenarios actually pressure the implementation in meaningful ways.

View File

@@ -67,6 +67,36 @@ public final class LoadBinaryStemmerExample {
}
```
You can tune in-memory child lookup density at load time without changing the artifact:
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.StemmerPatchTrieLoader;
public final class LoadBinaryStemmerExampleTuned {
private LoadBinaryStemmerExampleTuned() {
throw new AssertionError("No instances.");
}
public static void main(final String[] arguments) throws IOException {
final FrequencyTrie<String> fast = StemmerPatchTrieLoader.loadBinary(
Path.of("stemmers", "english.radixor.gz"),
1024);
final FrequencyTrie<String> compact = StemmerPatchTrieLoader.loadBinary(
Path.of("stemmers", "english.radixor.gz"),
128);
System.out.println("fast=" + fast.size() + ", compact=" + compact.size());
}
}
```
For the trade-off details, see [Lookup Edge Optimization](lookup-edge-optimization.md).
### Build or extend a stemmer from dictionary data
Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The input may be plain UTF-8 text or GZip-compressed UTF-8 text when loaded from a filesystem path. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace around columns, supports line remarks introduced by `#` or `//`, and skips dictionary items that contain embedded whitespace.

View File

@@ -23,7 +23,7 @@ These reports are primarily useful when reviewing the published API surface and
These reports describe the outcome of core verification and static-analysis stages for the latest published build:
- [Unit test report](https://leogalambos.github.io/Radixor/builds/latest/test/)
- [Release verification test report (ciRelease)](https://leogalambos.github.io/Radixor/builds/latest/test/)
- [PMD report](https://leogalambos.github.io/Radixor/builds/latest/pmd/main.html)
- [JaCoCo coverage report](https://leogalambos.github.io/Radixor/builds/latest/coverage/)
- [PIT mutation testing report](https://leogalambos.github.io/Radixor/builds/latest/pitest/)

View File

@@ -0,0 +1,216 @@
# Test Tag Taxonomy and Execution Guide
Radixor uses JUnit tags as an explicit execution policy for its test suite.
The project uses three orthogonal axes:
1. **Scope** (how the test is executed in the pipeline)
2. **Domain** (where in the system it belongs)
3. **Intent** (what behavior it verifies)
## Canonical scope tags
| Tag | Description | Typical usage |
| --- | --- | --- |
| `unit` | Fast, deterministic tests that exercise a specific class or behavior without external processes. | Default developer feedback; should stay near-zero flakiness and low run time. |
| `integration` | Tests that span multiple components or end-to-end flows of the public pipeline. | Parser/loader/CLI/IO integration checks and multi-step compile-then-load validations. |
| `property` | Property-based tests with generator-driven coverage for invariants. | Semantics-preserving laws and edge-case exploration beyond curated fixtures. |
| `fuzz` | Randomized stress checks with bounded runtime. | Heavier probabilistic verification of robustness and reduction invariants. |
| `compat` | Backward/forward compatibility and reproducibility checks for persisted artifacts. | Artifact fingerprints, deterministic rebuild, and regression fixtures. |
| `slow` | Long-running or expensive tests that should not execute in every fast gate. | Heavy fuzz/property budgets or high-duration integration checks. |
## Canonical domain tags
| Tag | Description | Typical usage |
| --- | --- | --- |
| `core` | Core algorithm and foundational platform behavior. | Traversal direction, base data structures, low-level helpers. |
| `trie` | All mutable/compiled trie behaviors and traversal internals. | Lookup path selection, node shape, child representation, subtree behavior. |
| `frequency-trie` | Algorithms and corner cases specific to frequency-aware trie logic. | Ranking, weighted reductions, persistence of weighted nodes. |
| `stemmer` | End-user stemming pipeline semantics. | Parse-encode-apply flows and output invariants. |
| `patch` | Patch encoding, decoding, and application semantics. | `PatchCommandEncoder` behavior and related compatibility contracts. |
| `io` | Input/output and resource loading boundaries. | Filesystem readers, streams, and stream lifecycle handling. |
| `serialization` | Binary persistence contract of compiled artifacts. | Versioned format reads/writes and checksum/consistency checks. |
| `parser` | Dictionary and metadata parsing concerns. | Dictionary input parsing and malformed-source rejection. |
| `cli` | Command-line entrypoint and command orchestration behavior. | Compile CLI integration and CLI argument validation. |
| `metadata` | Trie metadata semantics, compatibility fields, and schema expectations. | Version flags, structural properties, and metadata round-trips. |
| `compile` | Compile-time pipeline and build-oriented behavior. | Building, reduction-mode behavior, and compiled artifact generation. |
| `diacritic` | Unicode diacritic normalization and stripping behavior. | Accent-removal correctness and locale-safe normalization checks. |
## Canonical intent tags
| Tag | Description | Typical usage |
| --- | --- | --- |
| `construction` | Tests around construction and assembly of runtime structures. | Builders, loaders, and compile-time object construction contracts. |
| `lookup` | Read behavior and retrieval semantics. | `get()`, `getAll()`, traversal and missing-key behavior. |
| `persistence` | Storage lifecycle semantics. | Serialization/deserialization and round-trip correctness. |
| `reduction` | Reduction algorithm correctness and corner cases. | Dominance threshold, subtree deduplication, rank-preservation invariants. |
| `encoding` | Encoding transformation direction. | `PatchCommandEncoder.encode` and serialized command form generation. |
| `decoding` | Decoding/interpretation of persisted or runtime commands. | Optional consumers that parse and apply encoded command payloads. |
| `apply` | Patch application and transformation behavior. | Verifies that applied patches produce expected derived forms. |
| `normalization` | Canonicalization and cleanup behavior. | String normalization around case/shape and mirrored input paths. |
| `validation` | Input rejection and defensive checks. | Null/empty/invalid contracts and explicit failure conditions. |
| `regression` | Guard tests for behavior changes over time. | Known historical bugs and behavioral drift prevention. |
| `determinism` | Repeatable results under fixed input and settings. | Compile determinism, stable ordering, and artifact reproducibility. |
| `error-handling` | Exception surface and robustness expectations. | Recovery/failure modes and diagnostics quality. |
## Class-level rules
1. Every test class has **exactly one** scope tag.
2. Every test class has at least one domain tag.
3. Additional tags describe intent and may be used on classes or nested tests.
4. For each test class, intent tags should reflect the primary behavior under test, not historical naming conventions.
## Governance and execution policy
The following rules are used to keep the suite auditable and stable:
| Rule | Required state | Why |
| --- | --- | --- |
| Scope discipline | Exactly one scope tag per class. | Prevents accidental promotion of integration-only behavior into fast unit runs. |
| Coverage breadth | At least one domain tag per class. | Ensures tests can be grouped by subsystem for targeted review. |
| Intent specificity | Use at least one intent tag when behavior is non-trivial. | Makes failure triage faster and profile composition explicit. |
| Runtime policy | Never run `slow` tests in the default `unit` profile unless explicitly required. | Preserves turnaround for PR feedback while preserving deep checks. |
| Change risk | Any persistence or compatibility-affecting change must include `compat` in validation. | Protects long-lived binary artifact contracts. |
| Mutation resistance | `fuzz`/`property` sets should be gated to dedicated profiles. | Limits flakiness exposure and controls CI resource cost. |
## Suggested CI profiles
These are recommended launch profiles for local and CI usage and are also exposed as Gradle tasks:
- **Profile: `ci-smoke` (fast feedback):**
```
./gradlew test -DincludeTags=unit -DexcludeTags=slow
./gradlew ciSmoke
```
`ciSmoke` also excludes `org.egothor.stemmer.CompileIntegrationTest*` at test-name filter level as a
defensive fallback in case of future tag drift.
`ciRelease` also excludes
`org.egothor.stemmer.StemmerPatchTrieLoaderTest$BundledDictionaryTests*` at filter level.
- **Profile: `ci-core` (core behavioral coverage):**
```
./gradlew test -DincludeTags=unit,trie,frequency-trie,property
./gradlew ciCore
```
- **Profile: `ci-integration` (pipeline correctness):**
```
./gradlew test -DincludeTags=integration
./gradlew ciIntegration
```
- **Profile: `ci-slow` (explicit heavy validation):**
```
./gradlew ciSlow
```
- **Profile: `ci-compat` (artifact stability):**
```
./gradlew test -DincludeTags=compat,regression
./gradlew ciCompat
```
- **Profile: `ci-release` (strong confidence before release):**
```
./gradlew test -DexcludeTags=slow
./gradlew ciRelease
```
`ciRelease` is non-slow by policy and uses the same defensive name-based exclusion for
`org.egothor.stemmer.CompileIntegrationTest*` and
`org.egothor.stemmer.StemmerPatchTrieLoaderTest$BundledDictionaryTests*` in addition to tag filtering.
- **Profile: `ci-nightly` (extended hardening):**
```
./gradlew test -DincludeTags=fuzz
./gradlew ciNightly
```
- **Profile: `ci` (enterprise umbrella):**
```
./gradlew ci
```
`ci` and `ciRelease` intentionally do **not** include `slow` paths. Run `ciSlow` explicitly for production-dictionary stress and long-running corpus checks.
## Practical examples
All examples use Gradle with JUnit Platform integration:
- Only unit tests:
```
./gradlew test -DincludeTags=unit
```
- Integration tests only:
```
./gradlew test -DincludeTags=integration
```
- Only trie subsystem tests:
```
./gradlew test -DincludeTags=trie
```
- Deterministic fuzz checks:
```
./gradlew test -DincludeTags=fuzz
```
- Property tests:
```
./gradlew test -DincludeTags=property
```
- Stemmer + patch command behavior:
```
./gradlew test -DincludeTags=stemmer,patch
```
- Compatibility artifacts and regression checks:
```
./gradlew test -DincludeTags=compat
```
- Keep regression suite and remove long-running cases:
```
./gradlew test -DincludeTags=regression -DexcludeTags=slow
```
- Core + patch behavior:
```
./gradlew test -DincludeTags=trie,patch
```
- Deterministic compatibility and persistence checks:
```
./gradlew test -DincludeTags=compat,determinism,serialization
```
## Notes
- `-DincludeTags` and `-DexcludeTags` are interpreted by Gradle task filtering and forwarded into
JUnit tag filtering.
- Class-name filtering is also available via Gradle test selectors where needed
(for example, `--tests *CompileTest`), but tag filtering remains the default
execution strategy.
- `-DincludeTags` supports comma-separated literal tags. When you need a single exact tag with special
characters, quote the argument for the shell.