feat: implement dense-child optimized trie lookup and enterprise test/CI profile hardening

2026-05-16 03:24:07 +02:00
parent 50c3ab3432
commit dadab5514e
44 changed files with 2052 additions and 294 deletions
--- a/docs/lookup-edge-optimization.md
+++ b/docs/lookup-edge-optimization.md
@@ -0,0 +1,193 @@
+# Lookup Edge Optimization
+
+Compiled trie nodes (`CompiledNode`) use three lookup strategies when resolving child edges:
+
+1. dense array direct lookup,
+2. linear scan for very small child counts,
+3. binary search over sorted edge labels.
+
+This page explains the dense path, what `maxExpandedIndex` controls, and how to tune it.
+
+## Runtime model of one node
+
+For a node with sorted edge labels `char[] edges`, the implementation can materialize an
+index-aligned dense table when labels occupy a small compact code-point interval:
+
+```text
+span = maxEdge - minEdge
+use dense table iff (span <= maxExpandedIndex) and (maxExpandedIndex > 0)
+```
+
+When dense lookup is used, lookup is constant-time indexing:
+
+```text
+denseIndex = requestedEdge - minEdge
+return denseChildren[denseIndex]  // or null if outside interval
+```
+
+When dense lookup is not active (interval is too wide or the configured
+`maxExpandedIndex` is `0`), `CompiledNode` still chooses between two fallback
+strategies:
+
+- **linear scan** for very small child counts (`4` or fewer children),
+- **binary search** for larger child counts.
+
+This means the fallback method is selected by child count, not by “distance” alone.
+`linear scan` is therefore used when there are only a few edges even if those edges are
+spread across very distant code points.
+
+### Example: few edges, wide Unicode span
+
+```text
+edges = ['a', '中', '你']
+edge count = 3
+minEdge = 'a' (U+0061)
+maxEdge = '你' (U+4F60)
+span = 20319
+```
+
+- If `maxExpandedIndex = 512`, dense indexing is not used because `span > maxExpandedIndex`.
+- Because `edge count = 3` (<= 4), lookup falls back to a tiny linear scan of the
+  three labels.
+- This is exactly the case where you get benefit from the threshold even though the interval is wide.
+
+This is useful for non-Latin scripts as well: what matters is interval width in Unicode
+code points, not script name. A compact Arabic-range block can still benefit from dense
+lookups when keys stay in a tight code-point interval.
+
+## Why this is configurable
+
+`maxExpandedIndex` is only a performance/paging choice:
+
+- higher value:
+  - more compact intervals qualify for dense tables,
+  - more constant-time child lookup,
+  - more memory for dense tables in qualifying nodes.
+- lower value (or `0`):
+  - less dense-table allocation,
+  - fewer branches into constant-time path,
+  - lower materialization memory.
+
+The value never changes lookup semantics. It only changes the in-memory structure shape.
+
+## Persistence and loading model
+
+This threshold is **not** stored in `TrieMetadata`.
+
+- The binary format stores only trie payload and semantic metadata (`reduction`, `traversal`,
+  case/diacritic settings, and stream version).
+- `maxExpandedIndex` is chosen when materializing nodes in memory.
+- You can therefore keep one persisted artifact and load it with different in-memory
+  trade-offs depending on deployment constraints.
+
+## Default
+
+- `FrequencyTrie.DEFAULT_MAX_EXPANDED_INDEX == 512`
+- `CompiledNode.DEFAULT_MAX_EXPANDED_INDEX == 512`
+
+These are practical defaults for mixed-language text and Latin-like scripts where edge labels
+often cluster.
+
+## Tune during build (writable phase)
+
+Use the full `FrequencyTrie.Builder` constructor when you are compiling from source data.
+The builder threshold is applied while freezing reduced nodes into the immutable form.
+
+```java
+import org.egothor.stemmer.CaseProcessingMode;
+import org.egothor.stemmer.DiacriticProcessingMode;
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.ReductionSettings;
+import org.egothor.stemmer.WordTraversalDirection;
+
+final ReductionSettings settings = ReductionSettings.withDefaults(
+        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
+
+final FrequencyTrie.Builder<String> fastBuilder =
+        new FrequencyTrie.Builder<>(String[]::new,
+                settings,
+                WordTraversalDirection.BACKWARD,
+                CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT,
+                DiacriticProcessingMode.AS_IS,
+                1024); // prefer lookup speed
+
+// ... put(...) ...
+final FrequencyTrie<String> trie = fastBuilder.build();
+```
+
+Use `0` or `256` for lower memory while still building larger tries.
+
+```java
+final FrequencyTrie.Builder<String> compactBuilder =
+        new FrequencyTrie.Builder<>(String[]::new,
+                settings,
+                WordTraversalDirection.BACKWARD,
+                CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT,
+                DiacriticProcessingMode.AS_IS,
+                256); // lower memory profile
+```
+
+## Tune when loading a binary artifact (runtime phase)
+
+At artifact load time, you can tune the same trade-off independently of persisted metadata.
+
+```java
+import java.nio.file.Path;
+
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+var defaultLookup = StemmerPatchTrieLoader.loadBinary(
+        Path.of("stemmers", "english.radixor.gz"));
+
+var fastLookup = StemmerPatchTrieLoader.loadBinary(
+        Path.of("stemmers", "english.radixor.gz"), 1024);
+
+var compactLookup = StemmerPatchTrieLoader.loadBinary(
+        Path.of("stemmers", "english.radixor.gz"), 0);
+```
+
+You can also set the threshold directly with `FrequencyTrie.readFrom(...)` when reading streams:
+
+```java
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.zip.GZIPInputStream;
+
+import org.egothor.stemmer.FrequencyTrie;
+
+public final class StreamLoadExample {
+
+    private StreamLoadExample() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        try (InputStream fileInput = Files.newInputStream(Path.of("stemmers", "english.radixor.gz"));
+                GZIPInputStream gzip = new GZIPInputStream(fileInput);
+                DataInputStream dataInput = new DataInputStream(gzip)) {
+            final FrequencyTrie<String> compactOnLoad = FrequencyTrie.readFrom(
+                    dataInput,
+                    String[]::new,
+                    input -> input.readUTF(),
+                    256);
+        }
+    }
+}
+```
+
+Note: the string codec is intentionally inline in this snippet to keep it self-contained.
+
+## Practical guidance
+
+- Start with default (`512`) in production and profile before changing it.
+- Use `0` when memory is the priority and query throughput is not the bottleneck.
+- Use values around `1024` for workloads dominated by compact alphabets and very hot lookups.
+
+Trade-off expectation:
+
+- increasing `maxExpandedIndex` improves lookup speed when edges tend to occupy short spans,
+- decreasing it reduces per-node auxiliary memory in dense-span nodes.
--- a/docs/programmatic-loading-and-building.md
+++ b/docs/programmatic-loading-and-building.md
@@ -87,6 +87,43 @@ public final class LoadBinaryExample {

 The binary format is the native `FrequencyTrie` serialization wrapped in GZip compression. It includes persisted `TrieMetadata`, so lookup after loading uses the traversal, case-processing, diacritic-processing, and reduction settings captured when the trie was compiled.

+## Tune child lookup density when loading binaries
+
+To optimize hot-path latency, you can tune direct child indexing by passing `maxExpandedIndex`
+at load time. This does not change persisted metadata, only the materialized in-memory form.
+
+```java
+import java.io.IOException;
+import java.nio.file.Path;
+
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+public final class LoadBinaryWithDenseLookupExample {
+
+    private LoadBinaryWithDenseLookupExample() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        final FrequencyTrie<String> balanced = StemmerPatchTrieLoader.loadBinary(
+                Path.of("stemmers", "english.radixor.gz"));
+
+        final FrequencyTrie<String> fast = StemmerPatchTrieLoader.loadBinary(
+                Path.of("stemmers", "english.radixor.gz"),
+                1024);
+
+        final FrequencyTrie<String> compact = StemmerPatchTrieLoader.loadBinary(
+                Path.of("stemmers", "english.radixor.gz"),
+                0);
+    }
+}
+```
+
+Negative values still use `FrequencyTrie.DEFAULT_MAX_EXPANDED_INDEX`.
+
+[Lookup Edge Optimization](lookup-edge-optimization.md) describes the trade-off in detail and examples for build-time tuning as well.
+
 ## Build directly with a mutable builder

 A `FrequencyTrie.Builder<V>` accepts repeated `put(key, value)` calls and compiles the final read-only trie through `build()`. Compilation performs bottom-up reduction and produces the compact immutable runtime representation.
--- a/docs/programmatic-usage.md
+++ b/docs/programmatic-usage.md
@@ -25,6 +25,7 @@ This is why Radixor can generalize beyond explicitly listed forms and why compil
 The programmatic API is easier to understand when split by developer task:

 - [Loading and Building Stemmers](programmatic-loading-and-building.md) explains how to acquire a compiled stemmer from bundled resources, textual dictionaries, binary artifacts, or direct builder usage.
+- [Lookup Edge Optimization](lookup-edge-optimization.md) explains dense child lookup tuning and the speed/memory trade-off when materializing compiled tries.
 - [Querying and Ambiguity Handling](programmatic-querying-and-ambiguity.md) explains `get(...)`, `getAll(...)`, `getEntries(...)`, patch application, and the practical meaning of reduction modes.
 - [Extending and Persisting Compiled Tries](programmatic-extending-and-persistence.md) explains how to reopen compiled tries, add new lexical data, rebuild them, and store them as binary artifacts.

--- a/docs/quality-and-operations.md
+++ b/docs/quality-and-operations.md
@@ -58,6 +58,27 @@ A deterministic system is easier to test, easier to reason about, and safer to i

 The project is intended to maintain very high confidence in both core correctness and behavioral stability.

+The recommended execution strategy is defined by the tagged test profiles in [Test taxonomy and execution filtering](test-taxonomy-and-filtering.md). In practice, teams can execute profile tasks directly:
+
+- `./gradlew ciSmoke`: fast local/PR safety checks (`unit`, excluding `slow`; additionally excludes
+  `CompileIntegrationTest` as a defensive safeguard).
+- `./gradlew ciSlow`: enterprise heavy gate for all tests marked with `slow` (typically
+  production dictionary and large corpus verification). This should be used for scheduled/manual
+  hardening gates and not in standard release build.
+- `./gradlew ciCore`: behavioral coverage of trie and frequency-trie paths (`unit` + `property` where applicable)
+- `./gradlew ciIntegration`: pipeline and CLI integration path checks
+- `./gradlew ciCompat`: compatibility and regression verification for persisted artifacts
+- `./gradlew ciRelease`: full non-slow suite for release-confidence runs (all test tags except `slow`,
+  plus explicit name-based exclusion of `CompileIntegrationTest*` and
+  `StemmerPatchTrieLoaderTest$BundledDictionaryTests*` as additional guardrails)
+- `./gradlew ciNightly`: extended fuzz profile for robustness hardening
+- `./gradlew ci`: umbrella profile depending on smoke/core/integration/compat
+
+## Test taxonomy and execution filtering
+
+The full tag taxonomy and executable filter examples are documented in
+[Test taxonomy and execution filtering](test-taxonomy-and-filtering.md).
+
 ### Structural coverage

 High code coverage is treated as a useful signal, but not as a sufficient goal on its own. Coverage is valuable only when the covered scenarios actually pressure the implementation in meaningful ways.
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -67,6 +67,36 @@ public final class LoadBinaryStemmerExample {
 }
 ```

+You can tune in-memory child lookup density at load time without changing the artifact:
+
+```java
+import java.io.IOException;
+import java.nio.file.Path;
+
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+public final class LoadBinaryStemmerExampleTuned {
+
+    private LoadBinaryStemmerExampleTuned() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        final FrequencyTrie<String> fast = StemmerPatchTrieLoader.loadBinary(
+                Path.of("stemmers", "english.radixor.gz"),
+                1024);
+        final FrequencyTrie<String> compact = StemmerPatchTrieLoader.loadBinary(
+                Path.of("stemmers", "english.radixor.gz"),
+                128);
+
+        System.out.println("fast=" + fast.size() + ", compact=" + compact.size());
+    }
+}
+```
+
+For the trade-off details, see [Lookup Edge Optimization](lookup-edge-optimization.md).
+
 ### Build or extend a stemmer from dictionary data

 Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The input may be plain UTF-8 text or GZip-compressed UTF-8 text when loaded from a filesystem path. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace around columns, supports line remarks introduced by `#` or `//`, and skips dictionary items that contain embedded whitespace.
--- a/docs/reports.md
+++ b/docs/reports.md
@@ -23,7 +23,7 @@ These reports are primarily useful when reviewing the published API surface and

 These reports describe the outcome of core verification and static-analysis stages for the latest published build:

- [Unit test report](https://leogalambos.github.io/Radixor/builds/latest/test/)
+- [Release verification test report (ciRelease)](https://leogalambos.github.io/Radixor/builds/latest/test/)
 - [PMD report](https://leogalambos.github.io/Radixor/builds/latest/pmd/main.html)
 - [JaCoCo coverage report](https://leogalambos.github.io/Radixor/builds/latest/coverage/)
 - [PIT mutation testing report](https://leogalambos.github.io/Radixor/builds/latest/pitest/)
--- a/docs/test-taxonomy-and-filtering.md
+++ b/docs/test-taxonomy-and-filtering.md
@@ -0,0 +1,216 @@
+# Test Tag Taxonomy and Execution Guide
+
+Radixor uses JUnit tags as an explicit execution policy for its test suite.
+
+The project uses three orthogonal axes:
+
+1. **Scope** (how the test is executed in the pipeline)
+2. **Domain** (where in the system it belongs)
+3. **Intent** (what behavior it verifies)
+
+## Canonical scope tags
+
+| Tag | Description | Typical usage |
+| --- | --- | --- |
+| `unit` | Fast, deterministic tests that exercise a specific class or behavior without external processes. | Default developer feedback; should stay near-zero flakiness and low run time. |
+| `integration` | Tests that span multiple components or end-to-end flows of the public pipeline. | Parser/loader/CLI/IO integration checks and multi-step compile-then-load validations. |
+| `property` | Property-based tests with generator-driven coverage for invariants. | Semantics-preserving laws and edge-case exploration beyond curated fixtures. |
+| `fuzz` | Randomized stress checks with bounded runtime. | Heavier probabilistic verification of robustness and reduction invariants. |
+| `compat` | Backward/forward compatibility and reproducibility checks for persisted artifacts. | Artifact fingerprints, deterministic rebuild, and regression fixtures. |
+| `slow` | Long-running or expensive tests that should not execute in every fast gate. | Heavy fuzz/property budgets or high-duration integration checks. |
+
+## Canonical domain tags
+
+| Tag | Description | Typical usage |
+| --- | --- | --- |
+| `core` | Core algorithm and foundational platform behavior. | Traversal direction, base data structures, low-level helpers. |
+| `trie` | All mutable/compiled trie behaviors and traversal internals. | Lookup path selection, node shape, child representation, subtree behavior. |
+| `frequency-trie` | Algorithms and corner cases specific to frequency-aware trie logic. | Ranking, weighted reductions, persistence of weighted nodes. |
+| `stemmer` | End-user stemming pipeline semantics. | Parse-encode-apply flows and output invariants. |
+| `patch` | Patch encoding, decoding, and application semantics. | `PatchCommandEncoder` behavior and related compatibility contracts. |
+| `io` | Input/output and resource loading boundaries. | Filesystem readers, streams, and stream lifecycle handling. |
+| `serialization` | Binary persistence contract of compiled artifacts. | Versioned format reads/writes and checksum/consistency checks. |
+| `parser` | Dictionary and metadata parsing concerns. | Dictionary input parsing and malformed-source rejection. |
+| `cli` | Command-line entrypoint and command orchestration behavior. | Compile CLI integration and CLI argument validation. |
+| `metadata` | Trie metadata semantics, compatibility fields, and schema expectations. | Version flags, structural properties, and metadata round-trips. |
+| `compile` | Compile-time pipeline and build-oriented behavior. | Building, reduction-mode behavior, and compiled artifact generation. |
+| `diacritic` | Unicode diacritic normalization and stripping behavior. | Accent-removal correctness and locale-safe normalization checks. |
+
+## Canonical intent tags
+
+| Tag | Description | Typical usage |
+| --- | --- | --- |
+| `construction` | Tests around construction and assembly of runtime structures. | Builders, loaders, and compile-time object construction contracts. |
+| `lookup` | Read behavior and retrieval semantics. | `get()`, `getAll()`, traversal and missing-key behavior. |
+| `persistence` | Storage lifecycle semantics. | Serialization/deserialization and round-trip correctness. |
+| `reduction` | Reduction algorithm correctness and corner cases. | Dominance threshold, subtree deduplication, rank-preservation invariants. |
+| `encoding` | Encoding transformation direction. | `PatchCommandEncoder.encode` and serialized command form generation. |
+| `decoding` | Decoding/interpretation of persisted or runtime commands. | Optional consumers that parse and apply encoded command payloads. |
+| `apply` | Patch application and transformation behavior. | Verifies that applied patches produce expected derived forms. |
+| `normalization` | Canonicalization and cleanup behavior. | String normalization around case/shape and mirrored input paths. |
+| `validation` | Input rejection and defensive checks. | Null/empty/invalid contracts and explicit failure conditions. |
+| `regression` | Guard tests for behavior changes over time. | Known historical bugs and behavioral drift prevention. |
+| `determinism` | Repeatable results under fixed input and settings. | Compile determinism, stable ordering, and artifact reproducibility. |
+| `error-handling` | Exception surface and robustness expectations. | Recovery/failure modes and diagnostics quality. |
+
+## Class-level rules
+
+1. Every test class has **exactly one** scope tag.
+2. Every test class has at least one domain tag.
+3. Additional tags describe intent and may be used on classes or nested tests.
+4. For each test class, intent tags should reflect the primary behavior under test, not historical naming conventions.
+
+## Governance and execution policy
+
+The following rules are used to keep the suite auditable and stable:
+
+| Rule | Required state | Why |
+| --- | --- | --- |
+| Scope discipline | Exactly one scope tag per class. | Prevents accidental promotion of integration-only behavior into fast unit runs. |
+| Coverage breadth | At least one domain tag per class. | Ensures tests can be grouped by subsystem for targeted review. |
+| Intent specificity | Use at least one intent tag when behavior is non-trivial. | Makes failure triage faster and profile composition explicit. |
+| Runtime policy | Never run `slow` tests in the default `unit` profile unless explicitly required. | Preserves turnaround for PR feedback while preserving deep checks. |
+| Change risk | Any persistence or compatibility-affecting change must include `compat` in validation. | Protects long-lived binary artifact contracts. |
+| Mutation resistance | `fuzz`/`property` sets should be gated to dedicated profiles. | Limits flakiness exposure and controls CI resource cost. |
+
+## Suggested CI profiles
+
+These are recommended launch profiles for local and CI usage and are also exposed as Gradle tasks:
+
+- **Profile: `ci-smoke` (fast feedback):**
+
+```
+./gradlew test -DincludeTags=unit -DexcludeTags=slow
+./gradlew ciSmoke
+```
+
+`ciSmoke` also excludes `org.egothor.stemmer.CompileIntegrationTest*` at test-name filter level as a
+defensive fallback in case of future tag drift.
+`ciRelease` also excludes
+`org.egothor.stemmer.StemmerPatchTrieLoaderTest$BundledDictionaryTests*` at filter level.
+
+- **Profile: `ci-core` (core behavioral coverage):**
+
+```
+./gradlew test -DincludeTags=unit,trie,frequency-trie,property
+./gradlew ciCore
+```
+
+- **Profile: `ci-integration` (pipeline correctness):**
+
+```
+./gradlew test -DincludeTags=integration
+./gradlew ciIntegration
+```
+
+- **Profile: `ci-slow` (explicit heavy validation):**
+
+```
+./gradlew ciSlow
+```
+
+- **Profile: `ci-compat` (artifact stability):**
+
+```
+./gradlew test -DincludeTags=compat,regression
+./gradlew ciCompat
+```
+
+- **Profile: `ci-release` (strong confidence before release):**
+
+```
+./gradlew test -DexcludeTags=slow
+./gradlew ciRelease
+```
+`ciRelease` is non-slow by policy and uses the same defensive name-based exclusion for
+`org.egothor.stemmer.CompileIntegrationTest*` and
+`org.egothor.stemmer.StemmerPatchTrieLoaderTest$BundledDictionaryTests*` in addition to tag filtering.
+
+- **Profile: `ci-nightly` (extended hardening):**
+
+```
+./gradlew test -DincludeTags=fuzz
+./gradlew ciNightly
+```
+
+- **Profile: `ci` (enterprise umbrella):**
+
+```
+./gradlew ci
+```
+
+`ci` and `ciRelease` intentionally do **not** include `slow` paths. Run `ciSlow` explicitly for production-dictionary stress and long-running corpus checks.
+
+## Practical examples
+
+All examples use Gradle with JUnit Platform integration:
+
+- Only unit tests:
+
+```
+./gradlew test -DincludeTags=unit
+```
+
+- Integration tests only:
+
+```
+./gradlew test -DincludeTags=integration
+```
+
+- Only trie subsystem tests:
+
+```
+./gradlew test -DincludeTags=trie
+```
+
+- Deterministic fuzz checks:
+
+```
+./gradlew test -DincludeTags=fuzz
+```
+
+- Property tests:
+
+```
+./gradlew test -DincludeTags=property
+```
+
+- Stemmer + patch command behavior:
+
+```
+./gradlew test -DincludeTags=stemmer,patch
+```
+
+- Compatibility artifacts and regression checks:
+
+```
+./gradlew test -DincludeTags=compat
+```
+
+- Keep regression suite and remove long-running cases:
+
+```
+./gradlew test -DincludeTags=regression -DexcludeTags=slow
+```
+
+- Core + patch behavior:
+
+```
+./gradlew test -DincludeTags=trie,patch
+```
+
+- Deterministic compatibility and persistence checks:
+
+```
+./gradlew test -DincludeTags=compat,determinism,serialization
+```
+
+## Notes
+
+- `-DincludeTags` and `-DexcludeTags` are interpreted by Gradle task filtering and forwarded into
+  JUnit tag filtering.
+- Class-name filtering is also available via Gradle test selectors where needed
+  (for example, `--tests *CompileTest`), but tag filtering remains the default
+  execution strategy.
+- `-DincludeTags` supports comma-separated literal tags. When you need a single exact tag with special
+  characters, quote the argument for the shell.