Files
Radixor/docs/lookup-edge-optimization.md

6.6 KiB

Lookup Edge Optimization

Compiled trie nodes (CompiledNode) use three lookup strategies when resolving child edges:

  1. dense array direct lookup,
  2. linear scan for very small child counts,
  3. binary search over sorted edge labels.

This page explains the dense path, what maxExpandedIndex controls, and how to tune it.

Runtime model of one node

For a node with sorted edge labels char[] edges, the implementation can materialize an index-aligned dense table when labels occupy a small compact code-point interval:

span = maxEdge - minEdge
use dense table iff (span <= maxExpandedIndex) and (maxExpandedIndex > 0)

When dense lookup is used, lookup is constant-time indexing:

denseIndex = requestedEdge - minEdge
return denseChildren[denseIndex]  // or null if outside interval

When dense lookup is not active (interval is too wide or the configured maxExpandedIndex is 0), CompiledNode still chooses between two fallback strategies:

  • linear scan for very small child counts (4 or fewer children),
  • binary search for larger child counts.

This means the fallback method is selected by child count, not by “distance” alone. linear scan is therefore used when there are only a few edges even if those edges are spread across very distant code points.

Example: few edges, wide Unicode span

edges = ['a', '中', '你']
edge count = 3
minEdge = 'a' (U+0061)
maxEdge = '你' (U+4F60)
span = 20319
  • If maxExpandedIndex = 512, dense indexing is not used because span > maxExpandedIndex.
  • Because edge count = 3 (<= 4), lookup falls back to a tiny linear scan of the three labels.
  • This is exactly the case where you get benefit from the threshold even though the interval is wide.

This is useful for non-Latin scripts as well: what matters is interval width in Unicode code points, not script name. A compact Arabic-range block can still benefit from dense lookups when keys stay in a tight code-point interval.

Why this is configurable

maxExpandedIndex is only a performance/paging choice:

  • higher value:
    • more compact intervals qualify for dense tables,
    • more constant-time child lookup,
    • more memory for dense tables in qualifying nodes.
  • lower value (or 0):
    • less dense-table allocation,
    • fewer branches into constant-time path,
    • lower materialization memory.

The value never changes lookup semantics. It only changes the in-memory structure shape.

Persistence and loading model

This threshold is not stored in TrieMetadata.

  • The binary format stores only trie payload and semantic metadata (reduction, traversal, case/diacritic settings, and stream version).
  • maxExpandedIndex is chosen when materializing nodes in memory.
  • You can therefore keep one persisted artifact and load it with different in-memory trade-offs depending on deployment constraints.

Default

  • FrequencyTrie.DEFAULT_MAX_EXPANDED_INDEX == 512
  • CompiledNode.DEFAULT_MAX_EXPANDED_INDEX == 512

These are practical defaults for mixed-language text and Latin-like scripts where edge labels often cluster.

Tune during build (writable phase)

Use the full FrequencyTrie.Builder constructor when you are compiling from source data. The builder threshold is applied while freezing reduced nodes into the immutable form.

import org.egothor.stemmer.CaseProcessingMode;
import org.egothor.stemmer.DiacriticProcessingMode;
import org.egothor.stemmer.FrequencyTrie;
import org.egothor.stemmer.ReductionMode;
import org.egothor.stemmer.ReductionSettings;
import org.egothor.stemmer.WordTraversalDirection;

final ReductionSettings settings = ReductionSettings.withDefaults(
        ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);

final FrequencyTrie.Builder<String> fastBuilder =
        new FrequencyTrie.Builder<>(String[]::new,
                settings,
                WordTraversalDirection.BACKWARD,
                CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT,
                DiacriticProcessingMode.AS_IS,
                1024); // prefer lookup speed

// ... put(...) ...
final FrequencyTrie<String> trie = fastBuilder.build();

Use 0 or 256 for lower memory while still building larger tries.

final FrequencyTrie.Builder<String> compactBuilder =
        new FrequencyTrie.Builder<>(String[]::new,
                settings,
                WordTraversalDirection.BACKWARD,
                CaseProcessingMode.LOWERCASE_WITH_LOCALE_ROOT,
                DiacriticProcessingMode.AS_IS,
                256); // lower memory profile

Tune when loading a binary artifact (runtime phase)

At artifact load time, you can tune the same trade-off independently of persisted metadata.

import java.nio.file.Path;

import org.egothor.stemmer.StemmerPatchTrieLoader;

var defaultLookup = StemmerPatchTrieLoader.loadBinary(
        Path.of("stemmers", "english.radixor.gz"));

var fastLookup = StemmerPatchTrieLoader.loadBinary(
        Path.of("stemmers", "english.radixor.gz"), 1024);

var compactLookup = StemmerPatchTrieLoader.loadBinary(
        Path.of("stemmers", "english.radixor.gz"), 0);

You can also set the threshold directly with FrequencyTrie.readFrom(...) when reading streams:

import java.io.DataInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.zip.GZIPInputStream;

import org.egothor.stemmer.FrequencyTrie;

public final class StreamLoadExample {

    private StreamLoadExample() {
        throw new AssertionError("No instances.");
    }

    public static void main(final String[] arguments) throws IOException {
        try (InputStream fileInput = Files.newInputStream(Path.of("stemmers", "english.radixor.gz"));
                GZIPInputStream gzip = new GZIPInputStream(fileInput);
                DataInputStream dataInput = new DataInputStream(gzip)) {
            final FrequencyTrie<String> compactOnLoad = FrequencyTrie.readFrom(
                    dataInput,
                    String[]::new,
                    input -> input.readUTF(),
                    256);
        }
    }
}

Note: the string codec is intentionally inline in this snippet to keep it self-contained.

Practical guidance

  • Start with default (512) in production and profile before changing it.
  • Use 0 when memory is the priority and query throughput is not the bottleneck.
  • Use values around 1024 for workloads dominated by compact alphabets and very hot lookups.

Trade-off expectation:

  • increasing maxExpandedIndex improves lookup speed when edges tend to occupy short spans,
  • decreasing it reduces per-node auxiliary memory in dense-span nodes.