Radixor/docs/reduction-semantics.md

# Reduction Semantics

This document explains how **Radixor** decides that two subtrees are equivalent, how the different reduction modes work, and how those choices affect observable runtime behavior.

## Why reduction exists

Without reduction, the trie would still work, but many subtrees that mean the same thing would remain duplicated. The result would be a much larger runtime artifact than necessary.

Reduction solves that by merging semantically equivalent subtrees into one canonical representative.

The key idea is simple:

> if two subtrees behave the same way under the semantic contract chosen for compilation, only one physical copy is needed.

## Reduction is semantic, not merely structural

Radixor does not reduce nodes merely because they look similar locally. It reduces subtrees only when their **meaning** matches according to the selected mode.

That is why reduction is based on a **signature** that captures both:

1. the local semantics of the current node,
2. the structure and semantics of all descendant edges.

Conceptually:

```text
Signature = (LocalDescriptor, SortedChildDescriptors)
```

Two subtrees are merged only if their signatures are equal.

## Local descriptors

The local descriptor defines what “equivalent” means for the values stored at one node.

Radixor supports three semantic views.

### Ranked descriptor

The ranked descriptor preserves the full ordered result semantics of `getAll()`.

That means:

- candidate membership is preserved,
- local ordering is preserved,
- observable ranked multi-result behavior remains stable.

This is the most semantically faithful mode when ambiguity handling matters.

### Unordered descriptor

The unordered descriptor preserves the set of reachable results, but not their local ordering.

That means:

- candidate membership is preserved,
- ordering differences may be ignored,
- more subtrees can be merged than in ranked mode.

This mode is useful when alternative candidates matter but exact ranking does not.

### Dominant descriptor

The dominant descriptor focuses on the preferred result returned by `get()`.

This mode is used only when the dominant local candidate is strong enough according to configured thresholds:

- minimum winner percentage,
- winner-over-second ratio.

If that local dominance is not strong enough, Radixor does not force dominant semantics anyway. It falls back to ranked semantics for that node to avoid unsafe over-reduction.

That fallback is one of the most important safeguards in the design.

## Child descriptors

A subtree is not defined only by the values stored at the current node. It is also defined by what behavior is reachable through its children.

Each child contributes:

```text
(edge character, child signature)
```

Children are sorted by edge character so that signatures remain deterministic and stable.

This matters because reduction must not depend on incidental map iteration order or other non-semantic implementation details.

## Canonicalization

Once a subtree signature is computed, the reduction process checks whether an equivalent canonical subtree already exists.

If yes, the existing reduced node is reused.

If no, a new canonical reduced node is created and registered.

This turns reduction into a canonicalization process:

- compute semantic identity,
- find canonical representative,
- reuse or create,
- continue bottom-up.

That is how Radixor eliminates duplicated equivalent subtrees.

## Count aggregation and compiled state

When multiple original build-time subtrees collapse into one canonical reduced node, local counts may be aggregated.

This is an important point for understanding compiled artifacts.

A compiled trie is not always a verbatim replay of original insertion history. It is a canonical runtime structure that preserves the semantics guaranteed by the chosen reduction mode.

This explains two things:

- why compiled artifacts can become dramatically smaller,
- why reconstructing a builder from a compiled trie reflects the compiled state rather than the full original unreduced history.

## Reduction modes

### `MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS`

This mode merges subtrees only when their `getAll()` results are equivalent for every reachable key suffix and when local ordering is preserved.

Use this mode when:

- ambiguity handling matters,
- `getAll()` ordering should remain meaningful,
- behavioral fidelity is more important than maximum compression.

This is the safest and most generally recommended mode.

### `MERGE_SUBTREES_WITH_EQUIVALENT_UNORDERED_GET_ALL_RESULTS`

This mode also preserves `getAll()`-level membership equivalence for every reachable key suffix, but it ignores local ordering differences.

Use this mode when:

- alternative candidates still matter,
- exact ordering is less important,
- stronger reduction is acceptable.

This mode is more aggressive than ranked mode, but less semantically rich.

### `MERGE_SUBTREES_WITH_EQUIVALENT_DOMINANT_GET_RESULTS`

This mode focuses on preserving dominant `get()` semantics for every reachable key suffix, subject to dominance thresholds.

Use this mode when:

- the main operational concern is the preferred result,
- richer alternative-result behavior is less important,
- stronger reduction is desirable.

Because non-dominant nodes fall back to ranked semantics, this mode is not simply “discard everything except the winner”. It is a controlled reduction strategy with a built-in safety condition.

## Practical effect on runtime behavior

Reduction mode is not just a storage optimization setting. It affects what distinctions remain visible after compilation.

### When ranked mode is used

You can rely on full ranked `getAll()` semantics being preserved.

### When unordered mode is used

You can rely on candidate membership, but not necessarily on preserving the same local ranking distinctions.

### When dominant mode is used

You optimize primarily for preferred-result semantics. Alternative-result behavior may still exist, but it is no longer the primary semantic contract of the reduction.

## Choosing a mode

A practical rule of thumb is:

- choose **ranked** if you are unsure,
- choose **unordered** if alternative membership matters but ranking does not,
- choose **dominant** only when your application is fundamentally driven by `get()` and you understand the trade-off.

## Why this design works well

The reduction model succeeds because it does not confuse “smaller” with “acceptable”.

Instead, it makes the semantic contract explicit:

- what exactly must be preserved,
- what differences may be ignored,
- when a more aggressive mode is safe,
- when the system must fall back to a stricter interpretation.

That explicitness is what makes the compression trustworthy.

## Mental model to keep

If you want one concise mental model for reduction, use this one:

- build-time insertion collects examples,
- reduction asks which subtrees mean the same thing,
- the answer depends on the chosen semantic contract,
- canonical representatives are shared,
- the compiled trie preserves the behavior promised by that contract.

## Continue with

- [Architecture](architecture.md)
- [Programmatic usage](programmatic-usage.md)
- [CLI compilation](cli-compilation.md)