Radixor/docs/architecture.md

# Architecture

This document explains the structural architecture of **Radixor**: what data is stored, how it flows through the build pipeline, and how runtime lookup works once a compiled trie has been produced.

## The central idea

Radixor does not store final stems directly as a large flat lookup table. Instead, it stores **patch commands** that describe how a word form should be transformed into a canonical stem.

For example, if a dictionary states that `running` should reduce to `run`, the final runtime artifact does not need to store a full redundant `running -> run` output string entry in the simplest possible form. It can store a compact transformation command that expresses how to turn the source form into the target form.

That matters because many words share similar transformation patterns. Once those mappings are organized in a trie and compiled into a canonical structure, the result is much smaller and more reusable than a naive direct-output table.

## End-to-end build flow

The full build-time flow is:

```text
Dictionary -> Mutable trie -> Reduced trie -> Compiled trie
```

Each stage has a different purpose.

### Dictionary input

The textual dictionary groups known word forms under a canonical stem:

```text
run running runs ran
connect connected connecting connection
```

The first token is the canonical stem. The following tokens are known variants.

### Patch-command generation

Each variant is converted into a patch command that transforms the variant into the stem.

Conceptually:

```text
running -> <patch> -> run
runs    -> <patch> -> run
ran     -> <patch> -> run
```

If `storeOriginal` is enabled, the stem itself is also inserted using a canonical no-op patch.

### Mutable trie construction

Those patch-command values are inserted into a mutable trie keyed by the source surface form.

### Reduction

Equivalent subtrees are merged into canonical reduced nodes.

### Compilation

The reduced structure is frozen into an immutable compiled trie optimized for runtime lookup.

## Why a trie is used

A trie is useful because many word forms share structural fragments. Instead of storing each word independently, the trie reuses paths and organizes lookup by character traversal.

A trie node can contain:

- outgoing edges,
- one or more ordered values,
- counts aligned with those values.

This is why the structure can represent both:

- a single preferred result,
- multiple competing results for the same key.

## Stage 1: Mutable construction

The mutable build-time structure is created by `FrequencyTrie.Builder`.

This stage is optimized for insertion rather than runtime lookup. As dictionary data is added, the builder accumulates:

- child edges,
- local values,
- local frequencies of those values.

Those frequencies are not incidental metadata. They later influence both result ordering and, depending on reduction mode, the semantic identity of subtrees during reduction.

### Why the build-time form is mutable

The builder must be easy to extend and easy to aggregate into. That is the opposite of what a runtime lookup structure needs.

Build-time priorities are:

- flexibility,
- accumulation of counts,
- structural growth.

Runtime priorities are:

- compactness,
- immutability,
- fast lookup.

Radixor therefore keeps construction and runtime representation strictly separate.

## What a compiled node contains

After reduction and freezing, the runtime structure uses immutable compiled nodes.

A compiled node stores:

- `char[] edgeLabels`
- child-node references aligned with those labels
- ordered value arrays
- aligned count arrays

This array-based form is compact and efficient for lookup.

## Runtime lookup model

At runtime, lookup is conceptually simple:

1. traverse the compiled trie by the input key,
2. reach the node addressed by that key,
3. retrieve one or more stored patch commands,
4. apply the chosen patch command to the original word.

The trie itself does not create the final stem string. It selects the stored transformation command. `PatchCommandEncoder.apply(...)` then performs the actual transformation.

That separation is architecturally important:

- the trie is responsible for **selection**,
- patch application is responsible for **transformation**.

## `get()` and `getAll()`

The runtime API exposes two complementary views of the addressed node.

### `get()`

`get()` returns the locally preferred value stored at that node.

Preference is deterministic:

1. higher local frequency wins,
2. shorter textual representation wins,
3. lexicographically lower textual representation wins,
4. stable first-seen order acts as the final tie-breaker.

### `getAll()`

`getAll()` returns all locally stored values in deterministic ranked order.

This is what allows Radixor to preserve ambiguity explicitly instead of forcing every key into a single answer.

## Why multiple results can exist

Some stemming systems discard ambiguity early because they insist on returning exactly one answer.

Radixor does not require that simplification. If multiple plausible patch commands exist for a key, the compiled trie can preserve them and the runtime API can expose them.

That is useful when downstream logic wants to:

- inspect ambiguity,
- preserve alternatives for retrieval,
- apply later ranking or domain-specific selection.

## Why compiled artifacts are compact

The final compiled trie can be much smaller than the original dictionary for several reasons working together:

- patch commands are compact,
- trie paths reuse shared structure,
- reduction merges equivalent subtrees,
- binary persistence stores the already reduced form,
- GZip compression is applied on top of the binary format.

This is why a very large dictionary can still produce a manageable deployable runtime artifact.

## Why preparation can still use more memory

The compactness of the final artifact should not be confused with the memory usage of preparation.

Before reduction has completed, the mutable build-time structure must exist in memory. For large dictionaries, that temporary preparation cost can be noticeably higher than the size of the final persisted artifact or the loaded compiled trie.

That is why the preferred operational model is usually:

- compile offline,
- persist the compiled artifact,
- load the finished artifact in runtime services.

## Determinism as a design principle

Radixor favors deterministic behavior throughout the pipeline.

This appears in:

- lowercased dictionary parsing,
- stable value ordering,
- sorted child descriptors,
- canonical reduction signatures,
- reproducible compiled lookup behavior.

Determinism matters not only for tests, but also for operational trust. It makes stemming behavior explainable and reproducible across builds and environments.

## Continue with

- [Reduction Semantics](reduction-semantics.md)
- [Programmatic usage](programmatic-usage.md)
- [CLI compilation](cli-compilation.md)