feat: implement dense-child optimized trie lookup and enterprise test/CI profile hardening

2026-05-16 03:24:07 +02:00
parent 50c3ab3432
commit dadab5514e
44 changed files with 2052 additions and 294 deletions
--- a/docs/lookup-edge-optimization.md
+++ b/docs/lookup-edge-optimization.md
@@ -0,0 +1,193 @@
+# Lookup Edge Optimization
+
+Compiled trie nodes (`CompiledNode`) use three lookup strategies when resolving child edges:
+
+1. dense array direct lookup,
+2. linear scan for very small child counts,
+3. binary search over sorted edge labels.
+
+This page explains the dense path, what `maxExpandedIndex` controls, and how to tune it.
+
+## Runtime model of one node
+
+For a node with sorted edge labels `char[] edges`, the implementation can materialize an
+index-aligned dense table when labels occupy a small compact code-point interval:
+
+```text
+span = maxEdge - minEdge
+use dense table iff (span <= maxExpandedIndex) and (maxExpandedIndex > 0)
+```
+
+When dense lookup is used, lookup is constant-time indexing:
+
+```text
+denseIndex = requestedEdge - minEdge
+return denseChildren[denseIndex]  // or null if outside interval
+```
+
+When dense lookup is not active (interval is too wide or the configured
+`maxExpandedIndex` is `0`), `CompiledNode` still chooses between two fallback
+strategies:
+
+- **linear scan** for very small child counts (`4` or fewer children),
+- **binary search** for larger child counts.
+
+This means the fallback method is selected by child count, not by “distance” alone.
+`linear scan` is therefore used when there are only a few edges even if those edges are
+spread across very distant code points.
+
+### Example: few edges, wide Unicode span
+
+```text
+edges = ['a', '中', '你']
+edge count = 3
+minEdge = 'a' (U+0061)
+maxEdge = '你' (U+4F60)
+span = 20319
+```
+
+- If `maxExpandedIndex = 512`, dense indexing is not used because `span > maxExpandedIndex`.
+- Because `edge count = 3` (<= 4), lookup falls back to a tiny linear scan of the
+  three labels.
+- This is exactly the case where you get benefit from the threshold even though the interval is wide.
+
+This is useful for non-Latin scripts as well: what matters is interval width in Unicode
+code points, not script name. A compact Arabic-range block can still benefit from dense
+lookups when keys stay in a tight code-point interval.
+
+## Why this is configurable
+
+`maxExpandedIndex` is only a performance/paging choice:
+
+- higher value:
+  - more compact intervals qualify for dense tables,
+  - more constant-time child lookup,
+  - more memory for dense tables in qualifying nodes.
+- lower value (or `0`):
+  - less dense-table allocation,
+  - fewer branches into constant-time path,
+  - lower materialization memory.
+
+The value never changes lookup semantics. It only changes the in-memory structure shape.
+
+## Persistence and loading model
+
+This threshold is **not** stored in `TrieMetadata`.
+
+- The binary format stores only trie payload and semantic metadata (`reduction`, `traversal`,
+  case/diacritic settings, and stream version).
+- `maxExpandedIndex` is chosen when materializing nodes in memory.
+- You can therefore keep one persisted artifact and load it with different in-memory
+  trade-offs depending on deployment constraints.
+
+## Default
+
+- `FrequencyTrie.DEFAULT_MAX_EXPANDED_INDEX == 512`
+- `CompiledNode.DEFAULT_MAX_EXPANDED_INDEX == 512`
+
+These are practical defaults for mixed-language text and Latin-like scripts where edge labels
+often cluster.
+
+## Tune during build (writable phase)
+
+Use the full `FrequencyTrie.Builder` constructor when you are compiling from source data.
+The builder threshold is applied while freezing reduced nodes into the immutable form.
+
+```java
+import org.egothor.stemmer.CaseProcessingMode;
+import org.egothor.stemmer.DiacriticProcessingMode;
+import org.egothor.stemmer.FrequencyTrie;
+import org.egothor.stemmer.ReductionMode;
+import org.egothor.stemmer.ReductionSettings;
+import org.egothor.stemmer.WordTraversalDirection;
+
+final ReductionSettings settings = ReductionSettings.withDefaults(
+        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
+
+final FrequencyTrie.Builder<String> fastBuilder =
+        new FrequencyTrie.Builder<>(String[]::new,
+                settings,
+                WordTraversalDirection.BACKWARD,
+                CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT,
+                DiacriticProcessingMode.AS_IS,
+                1024); // prefer lookup speed
+
+// ... put(...) ...
+final FrequencyTrie<String> trie = fastBuilder.build();
+```
+
+Use `0` or `256` for lower memory while still building larger tries.
+
+```java
+final FrequencyTrie.Builder<String> compactBuilder =
+        new FrequencyTrie.Builder<>(String[]::new,
+                settings,
+                WordTraversalDirection.BACKWARD,
+                CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT,
+                DiacriticProcessingMode.AS_IS,
+                256); // lower memory profile
+```
+
+## Tune when loading a binary artifact (runtime phase)
+
+At artifact load time, you can tune the same trade-off independently of persisted metadata.
+
+```java
+import java.nio.file.Path;
+
+import org.egothor.stemmer.StemmerPatchTrieLoader;
+
+var defaultLookup = StemmerPatchTrieLoader.loadBinary(
+        Path.of("stemmers", "english.radixor.gz"));
+
+var fastLookup = StemmerPatchTrieLoader.loadBinary(
+        Path.of("stemmers", "english.radixor.gz"), 1024);
+
+var compactLookup = StemmerPatchTrieLoader.loadBinary(
+        Path.of("stemmers", "english.radixor.gz"), 0);
+```
+
+You can also set the threshold directly with `FrequencyTrie.readFrom(...)` when reading streams:
+
+```java
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.util.zip.GZIPInputStream;
+
+import org.egothor.stemmer.FrequencyTrie;
+
+public final class StreamLoadExample {
+
+    private StreamLoadExample() {
+        throw new AssertionError("No instances.");
+    }
+
+    public static void main(final String[] arguments) throws IOException {
+        try (InputStream fileInput = Files.newInputStream(Path.of("stemmers", "english.radixor.gz"));
+                GZIPInputStream gzip = new GZIPInputStream(fileInput);
+                DataInputStream dataInput = new DataInputStream(gzip)) {
+            final FrequencyTrie<String> compactOnLoad = FrequencyTrie.readFrom(
+                    dataInput,
+                    String[]::new,
+                    input -> input.readUTF(),
+                    256);
+        }
+    }
+}
+```
+
+Note: the string codec is intentionally inline in this snippet to keep it self-contained.
+
+## Practical guidance
+
+- Start with default (`512`) in production and profile before changing it.
+- Use `0` when memory is the priority and query throughput is not the bottleneck.
+- Use values around `1024` for workloads dominated by compact alphabets and very hot lookups.
+
+Trade-off expectation:
+
+- increasing `maxExpandedIndex` improves lookup speed when edges tend to occupy short spans,
+- decreasing it reduces per-node auxiliary memory in dense-span nodes.