# Lookup Edge Optimization Compiled trie nodes (`CompiledNode`) use three lookup strategies when resolving child edges: 1. dense array direct lookup, 2. linear scan for very small child counts, 3. binary search over sorted edge labels. This page explains the dense path, what `maxExpandedIndex` controls, and how to tune it. ## Runtime model of one node For a node with sorted edge labels `char[] edges`, the implementation can materialize an index-aligned dense table when labels occupy a small compact code-point interval: ```text span = maxEdge - minEdge use dense table iff (span <= maxExpandedIndex) and (maxExpandedIndex > 0) ``` When dense lookup is used, lookup is constant-time indexing: ```text denseIndex = requestedEdge - minEdge return denseChildren[denseIndex] // or null if outside interval ``` When dense lookup is not active (interval is too wide or the configured `maxExpandedIndex` is `0`), `CompiledNode` still chooses between two fallback strategies: - **linear scan** for very small child counts (`4` or fewer children), - **binary search** for larger child counts. This means the fallback method is selected by child count, not by “distance” alone. `linear scan` is therefore used when there are only a few edges even if those edges are spread across very distant code points. ### Example: few edges, wide Unicode span ```text edges = ['a', '中', '你'] edge count = 3 minEdge = 'a' (U+0061) maxEdge = '你' (U+4F60) span = 20319 ``` - If `maxExpandedIndex = 512`, dense indexing is not used because `span > maxExpandedIndex`. - Because `edge count = 3` (<= 4), lookup falls back to a tiny linear scan of the three labels. - This is exactly the case where you get benefit from the threshold even though the interval is wide. This is useful for non-Latin scripts as well: what matters is interval width in Unicode code points, not script name. A compact Arabic-range block can still benefit from dense lookups when keys stay in a tight code-point interval. ## Why this is configurable `maxExpandedIndex` is only a performance/paging choice: - higher value: - more compact intervals qualify for dense tables, - more constant-time child lookup, - more memory for dense tables in qualifying nodes. - lower value (or `0`): - less dense-table allocation, - fewer branches into constant-time path, - lower materialization memory. The value never changes lookup semantics. It only changes the in-memory structure shape. ## Persistence and loading model This threshold is **not** stored in `TrieMetadata`. - The binary format stores only trie payload and semantic metadata (`reduction`, `traversal`, case/diacritic settings, and stream version). - `maxExpandedIndex` is chosen when materializing nodes in memory. - You can therefore keep one persisted artifact and load it with different in-memory trade-offs depending on deployment constraints. ## Default - `FrequencyTrie.DEFAULT_MAX_EXPANDED_INDEX == 512` - `CompiledNode.DEFAULT_MAX_EXPANDED_INDEX == 512` These are practical defaults for mixed-language text and Latin-like scripts where edge labels often cluster. ## Tune during build (writable phase) Use the full `FrequencyTrie.Builder` constructor when you are compiling from source data. The builder threshold is applied while freezing reduced nodes into the immutable form. ```java import org.egothor.stemmer.CaseProcessingMode; import org.egothor.stemmer.DiacriticProcessingMode; import org.egothor.stemmer.FrequencyTrie; import org.egothor.stemmer.ReductionMode; import org.egothor.stemmer.ReductionSettings; import org.egothor.stemmer.WordTraversalDirection; final ReductionSettings settings = ReductionSettings.withDefaults( ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS); final FrequencyTrie.Builder fastBuilder = new FrequencyTrie.Builder<>(String[]::new, settings, WordTraversalDirection.BACKWARD, CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT, DiacriticProcessingMode.AS_IS, 1024); // prefer lookup speed // ... put(...) ... final FrequencyTrie trie = fastBuilder.build(); ``` Use `0` or `256` for lower memory while still building larger tries. ```java final FrequencyTrie.Builder compactBuilder = new FrequencyTrie.Builder<>(String[]::new, settings, WordTraversalDirection.BACKWARD, CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT, DiacriticProcessingMode.AS_IS, 256); // lower memory profile ``` ## Tune when loading a binary artifact (runtime phase) At artifact load time, you can tune the same trade-off independently of persisted metadata. ```java import java.nio.file.Path; import org.egothor.stemmer.StemmerPatchTrieLoader; var defaultLookup = StemmerPatchTrieLoader.loadBinary( Path.of("stemmers", "english.radixor.gz")); var fastLookup = StemmerPatchTrieLoader.loadBinary( Path.of("stemmers", "english.radixor.gz"), 1024); var compactLookup = StemmerPatchTrieLoader.loadBinary( Path.of("stemmers", "english.radixor.gz"), 0); ``` You can also set the threshold directly with `FrequencyTrie.readFrom(...)` when reading streams: ```java import java.io.DataInputStream; import java.io.IOException; import java.io.InputStream; import java.nio.file.Files; import java.nio.file.Path; import java.util.zip.GZIPInputStream; import org.egothor.stemmer.FrequencyTrie; public final class StreamLoadExample { private StreamLoadExample() { throw new AssertionError("No instances."); } public static void main(final String[] arguments) throws IOException { try (InputStream fileInput = Files.newInputStream(Path.of("stemmers", "english.radixor.gz")); GZIPInputStream gzip = new GZIPInputStream(fileInput); DataInputStream dataInput = new DataInputStream(gzip)) { final FrequencyTrie compactOnLoad = FrequencyTrie.readFrom( dataInput, String[]::new, input -> input.readUTF(), 256); } } } ``` Note: the string codec is intentionally inline in this snippet to keep it self-contained. ## Practical guidance - Start with default (`512`) in production and profile before changing it. - Use `0` when memory is the priority and query throughput is not the bottleneck. - Use values around `1024` for workloads dominated by compact alphabets and very hot lookups. Trade-off expectation: - increasing `maxExpandedIndex` improves lookup speed when edges tend to occupy short spans, - decreasing it reduces per-node auxiliary memory in dense-span nodes.