Files

Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening

2026-04-13 02:10:46 +02:00

8.4 KiB

Raw Blame History

Architecture and Reduction

← Back to README.md

This document describes the internal architecture of Radixor and the principles behind its trie compilation and reduction model.

It explains:

how data flows from dictionary input to compiled trie
how patch-command tries are structured
how subtree reduction works
how reduction modes affect behavior and size

Overview

Radixor transforms dictionary data into an optimized runtime structure through three stages:

Mutable construction
Reduction (canonicalization)
Compilation (freezing)

Dictionary → Mutable trie → Reduced trie → Compiled trie

Each stage has a distinct purpose:

Stage	Purpose	Structure
Build	Collect mappings	`MutableNode`
Reduction	Merge equivalent subtrees	`ReducedNode`
Compilation	Optimize for runtime lookup	`CompiledNode`

Core data model

Patch-command trie

Radixor stores patch commands instead of stems directly.

keys: word forms
values: transformation commands
structure: trie (prefix tree)

At runtime:

the word is traversed through the trie
a patch command is retrieved
the patch is applied to reconstruct the stem

Stage 1: Mutable construction

The builder (FrequencyTrie.Builder) constructs a trie using:

MutableNode
maps of children (char → node)
maps of value counts (value → frequency)

Characteristics:

insertion-order preserving
mutable
optimized for building, not querying

Example structure:

g
 └─ n
     └─ i
         └─ n
             └─ n
                 └─ u
                     └─ r
                         └─ (values: {
                               "<patch-command-1>": 3,
                               "<patch-command-2>": 1
                           })

This example represents the word "running", stored in reversed form.

each edge corresponds to one character of the word
the path is traversed from the end of the word toward the beginning
the terminal node stores one or more patch commands together with their local frequencies

The values represent transformations from the word form to candidate stems, and the counts indicate how often each mapping was observed during construction.

Note: Radixor stores word forms in reversed order so that suffix-based transformations can be matched efficiently in a trie.

Local value summary

Before reduction, each node is summarized using LocalValueSummary.

It computes:

ordered values (by frequency)
aligned counts
total frequency
dominant value (if any)
second-best value

This summary is critical for:

deterministic ordering
reduction decisions
dominance evaluation

Stage 2: Reduction (canonicalization)

Reduction is the process of merging semantically equivalent subtrees.

Why reduction exists

Without reduction:

trie size grows linearly with input data
repeated patterns are duplicated

With reduction:

identical subtrees are shared
memory footprint is reduced
binary output becomes smaller

Reduction signature

Each subtree is represented by a ReductionSignature.

A signature consists of:

local descriptor (node semantics)
child descriptors (structure)

Signature = (LocalDescriptor, SortedChildDescriptors)

Two subtrees are merged if their signatures are equal.

Local descriptors

The local descriptor encodes how values at a node are interpreted.

Radixor supports three descriptor types:

1. Ranked descriptor

Preserves:

full ordering of values (getAll())

Uses:

ordered value list

Best for:

correctness
deterministic multi-result behavior

2. Unordered descriptor

Preserves:

only membership (set of values)

Ignores:

ordering differences

Best for:

higher compression
use cases where ordering is irrelevant

3. Dominant descriptor

Preserves:

only the dominant value (get())

Condition:

dominant value must satisfy thresholds:
- minimum percentage
- ratio over second-best

Fallback:

if dominance is not strong enough → ranked descriptor is used

Best for:

maximum compression
single-result workflows

Child descriptors

Each child is represented as:

(edge character, child signature)

Children are sorted by edge character to ensure:

deterministic signatures
stable equality comparisons

Reduction context

ReductionContext maintains:

mapping: ReductionSignature → ReducedNode
canonical instances of subtrees

Workflow:

compute signature
check if already exists
reuse existing node or create new one

This ensures:

structural sharing
no duplicate equivalent subtrees

Reduced nodes

ReducedNode represents:

canonical subtree
aggregated value counts
canonical children

It supports:

merging local counts
verifying structural consistency

At this stage:

structure is canonical
still mutable (internally)

Stage 3: Compilation (freezing)

The reduced trie is converted into a CompiledNode structure.

CompiledNode characteristics

immutable
array-based storage
optimized for fast lookup

Fields:

char[] edgeLabels
CompiledNode[] children
V[] orderedValues
int[] orderedCounts

Lookup algorithm

Runtime lookup:

traverse trie using edgeLabels (matching characters from the end of the word toward the beginning)
binary search per node
retrieve values
apply patch command

Properties:

O(length of word)
low memory overhead
minimal memory allocation during lookup; patch application produces the resulting string

Deterministic ordering

Value ordering is deterministic and stable:

higher frequency first
shorter string first
lexicographically smaller
insertion order

This guarantees:

reproducible builds
stable query results
predictable ranking

Reduction modes

Reduction modes control how local descriptors are chosen.

Ranked mode

MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS

preserves full semantics
safest option
recommended default

Unordered mode

MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS

ignores ordering
higher compression
slightly weaker semantics

Dominant mode

MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS

keeps only dominant result
highest compression
may lose alternative candidates

Trade-offs

Aspect	Ranked	Unordered	Dominant
Compression	Medium	High	Highest
Accuracy	High	Medium	Lower
getAll()	Full	Partial	Limited
get()	Exact	Exact	Heuristic

Deserialization model

Binary loading uses:

NodeData as intermediate representation
reconstruction of CompiledNode

This separates:

I/O format
in-memory structure

Why this architecture works

Radixor achieves:

Compactness

subtree sharing
efficient encoding
compressed binary output

Performance

array-based lookup
no runtime reduction
minimal branching

Flexibility

configurable reduction strategies
multiple result support
dictionary-driven behavior

Determinism

stable ordering
canonical signatures
reproducible builds

Design philosophy

The architecture reflects a few key principles:

separate build-time complexity from runtime simplicity
encode semantics explicitly (not implicitly in code)
favor deterministic behavior over heuristic shortcuts
allow controlled trade-offs between size and fidelity

When to tune reduction

You should consider changing reduction mode when:

binary size is too large
memory footprint must be minimized
only single-result stemming is needed

Otherwise:

use ranked mode by default

Next steps

Summary

Radixor’s architecture is built around:

patch-command tries
canonical subtree reduction
immutable compiled structures

This design allows the system to remain:

fast
compact
deterministic
adaptable

while still supporting advanced use cases such as:

ambiguity-aware stemming
dictionary evolution
controlled trade-offs between size and behavior

8.4 KiB Raw Blame History Unescape Escape

Architecture and Reduction

Overview

Core data model

Patch-command trie

Stage 1: Mutable construction

Local value summary

Stage 2: Reduction (canonicalization)

Why reduction exists

Reduction signature

Local descriptors

1. Ranked descriptor

2. Unordered descriptor

3. Dominant descriptor

Child descriptors

Reduction context

Reduced nodes

Stage 3: Compilation (freezing)

CompiledNode characteristics

Lookup algorithm

Deterministic ordering

Reduction modes

Ranked mode

Unordered mode

Dominant mode

Trade-offs

Deserialization model

Why this architecture works

Compactness

Performance

Flexibility

Determinism

Design philosophy

When to tune reduction

Next steps

Summary

8.4 KiB

Raw Blame History