Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening
This commit is contained in:
2026-04-13 02:10:46 +02:00
parent 15248c92c9
commit 038514bad0
64 changed files with 190190 additions and 20 deletions

255
docs/dictionary-format.md Normal file
View File

@@ -0,0 +1,255 @@
# Dictionary Format
> ← Back to [README.md](../README.md)
Radixor uses a simple, line-oriented dictionary format to define mappings between **word forms** and their **canonical stems**.
This format is intentionally minimal, language-agnostic, and easy to generate from existing linguistic resources or corpora.
## Overview
Each logical line defines:
- one **canonical stem**
- zero or more **word variants** belonging to that stem
```
stem variant1 variant2 variant3 ...
```
At compile time:
- each variant is converted into a **patch command** transforming the variant into the stem
- the stem itself may optionally be stored as a **no-op mapping**
## Basic example
```
run running runs ran
connect connected connecting connection
analyze analyzing analysed analyses
```
This defines:
| Stem | Variants |
|----------|----------------------------------------|
| run | running, runs, ran |
| connect | connected, connecting, connection |
| analyze | analyzing, analysed, analyses |
## Syntax rules
### 1. Tokenization
- Tokens are separated by **whitespace**
- Multiple spaces and tabs are treated as a single separator
- Leading and trailing whitespace is ignored
### 2. First token is the stem
- The **first token** on each line is always the canonical stem
- All following tokens are treated as variants of that stem
### 3. Case normalization
- All input is normalized to **lowercase using `Locale.ROOT`**
- Dictionaries should ideally already be lowercase to avoid ambiguity
### 4. Empty lines
- Empty lines are ignored
### 5. Duplicate variants
- Duplicate variants are allowed but have no additional effect
- Frequency is determined by occurrence across the entire dataset
## Remarks (comments)
The parser supports both full-line and trailing remarks.
### Supported remark markers
- `#`
- `//`
### Examples
```
run running runs ran # English verb forms
connect connected connecting // basic forms
```
Everything after the first occurrence of a remark marker is ignored.
### Important note
Remark markers are not escaped. If `#` or `//` appear in a token, they will terminate the line.
## Storing the original form
When compiling, you may enable:
```
--store-original
```
This causes the stem itself to be stored using a **no-op patch command**.
Example:
```
run running runs
```
With `--store-original`, this implicitly includes:
```
run -> run
```
This is useful when:
- the input may already be normalized
- you want stable identity mappings
- you want to avoid missing entries for canonical forms
## Frequency and ordering
Radixor tracks **local frequencies** of values.
Frequency is determined by:
- how many times a mapping appears during construction
- merging behavior during reduction
When multiple stems exist for a word:
- results are ordered by **descending frequency**
- ties are resolved deterministically:
1. shorter textual representation wins
2. lexicographically smaller value wins
3. earlier insertion order wins
This guarantees **stable and reproducible results**.
## Ambiguity and multiple stems
A word may legitimately map to more than one stem:
```
axes ax axe
```
This allows Radixor to represent ambiguity explicitly.
At runtime:
- `get(word)` returns the **preferred result**
- `getAll(word)` returns **all candidates**
## Design guidelines
### Keep stems consistent
Use a single canonical form:
- `run` instead of mixing `run` / `running`
- `analyze` vs `analyse` — pick one convention
### Avoid noise
Do not include:
- typos
- extremely rare forms (unless required)
- inconsistent normalization
### Prefer completeness over clever rules
Radixor is data-driven:
- more complete dictionaries → better results
- no hidden rule system compensates for missing entries
### Handle domain-specific vocabulary
You can extend dictionaries with:
- product names
- technical terms
- organization-specific terminology
## Example: minimal dictionary
```
go goes going went
be is are was were being
have has having had
```
## Example: domain-specific extension
```
microservice microservices
container containers containerized
kubernetes kubernetes
```
## Common pitfalls
### Mixing cases
```
Run running Runs ❌
```
→ normalized to lowercase, but inconsistent input is error-prone
### Multiple stems on one line
```
run running connect ❌
```
`connect` becomes a variant of `run`, which is incorrect
### Hidden comments
```
run running //comment runs ❌
```
→ everything after `//` is ignored
## When to use this format
This format is suitable for:
- curated linguistic datasets
- exported morphological dictionaries
- domain-specific vocabularies
- generated `(word, stem)` pairs from corpora
## Next steps
- [CLI compilation](cli-compilation.md)
- [Programmatic usage](programmatic-usage.md)
- [Quick start](quick-start.md)
## Summary
Radixor dictionaries are intentionally simple:
- one line per stem
- whitespace-separated tokens
- optional remarks
- no embedded rules
This simplicity enables:
- easy generation
- fast parsing
- deterministic behavior
- efficient compilation into compact patch-command tries