feat: implement Compile CLI for building binary stemmer tables from source dictionaries feat: add loading support for persisted compiled tries, including GZip-compressed binaries feat: add a builder path for recreating a writable trie from a compiled trie feat: expose read-only value/count access for compiled trie entries feat: support deterministic NOOP patch encoding for identical source and target words fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers fix: preserve valid alternative reductions during trie optimization and reduction fix: correct patch command edge cases discovered in round-trip and malformed-input tests fix: address persistence and compiled-trie handling defects found during implementation review fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs refactor: reorganize trie-related support types into dedicated packages and classes refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture refactor: improve compiled/read-only trie boundaries without restoring mutability refactor: clean up internal reduction, serialization, and helper structure test: add professional JUnit coverage for stemmer core classes test: split trie tests into dedicated test classes per production type test: improve parameterized tests for readability, diagnostics, and edge-case traceability test: cover positive, negative, malformed, persistence, and round-trip scenarios test: verify compiled dictionaries against source inputs using getAll semantics docs: write public README and supplementary Markdown documentation for project publishing docs: document architecture, reduction model, built-in languages, and operational guidance docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation docs: improve examples and wording for professional reader-facing project guidance chore: align project materials with the practical Radix scope and Egothor/Stempel lineage chore: raise overall project quality through documentation review and test hardening
5.1 KiB
Dictionary Format
← Back to README.md
Radixor uses a simple, line-oriented dictionary format to define mappings between word forms and their canonical stems.
This format is intentionally minimal, language-agnostic, and easy to generate from existing linguistic resources or corpora.
Overview
Each logical line defines:
- one canonical stem
- zero or more word variants belonging to that stem
stem variant1 variant2 variant3 ...
At compile time:
- each variant is converted into a patch command transforming the variant into the stem
- the stem itself may optionally be stored as a no-op mapping
Basic example
run running runs ran
connect connected connecting connection
analyze analyzing analysed analyses
This defines:
| Stem | Variants |
|---|---|
| run | running, runs, ran |
| connect | connected, connecting, connection |
| analyze | analyzing, analysed, analyses |
Syntax rules
1. Tokenization
- Tokens are separated by whitespace
- Multiple spaces and tabs are treated as a single separator
- Leading and trailing whitespace is ignored
2. First token is the stem
- The first token on each line is always the canonical stem
- All following tokens are treated as variants of that stem
3. Case normalization
- All input is normalized to lowercase using
Locale.ROOT - Dictionaries should ideally already be lowercase to avoid ambiguity
4. Empty lines
- Empty lines are ignored
5. Duplicate variants
- Duplicate variants are allowed but have no additional effect
- Frequency is determined by occurrence across the entire dataset
Remarks (comments)
The parser supports both full-line and trailing remarks.
Supported remark markers
#//
Examples
run running runs ran # English verb forms
connect connected connecting // basic forms
Everything after the first occurrence of a remark marker is ignored.
Important note
Remark markers are not escaped. If # or // appear in a token, they will terminate the line.
Storing the original form
When compiling, you may enable:
--store-original
This causes the stem itself to be stored using a no-op patch command.
Example:
run running runs
With --store-original, this implicitly includes:
run -> run
This is useful when:
- the input may already be normalized
- you want stable identity mappings
- you want to avoid missing entries for canonical forms
Frequency and ordering
Radixor tracks local frequencies of values.
Frequency is determined by:
- how many times a mapping appears during construction
- merging behavior during reduction
When multiple stems exist for a word:
- results are ordered by descending frequency
- ties are resolved deterministically:
- shorter textual representation wins
- lexicographically smaller value wins
- earlier insertion order wins
This guarantees stable and reproducible results.
Ambiguity and multiple stems
A word may legitimately map to more than one stem:
axes ax axe
This allows Radixor to represent ambiguity explicitly.
At runtime:
get(word)returns the preferred resultgetAll(word)returns all candidates
Design guidelines
Keep stems consistent
Use a single canonical form:
runinstead of mixingrun/runninganalyzevsanalyse— pick one convention
Avoid noise
Do not include:
- typos
- extremely rare forms (unless required)
- inconsistent normalization
Prefer completeness over clever rules
Radixor is data-driven:
- more complete dictionaries → better results
- no hidden rule system compensates for missing entries
Handle domain-specific vocabulary
You can extend dictionaries with:
- product names
- technical terms
- organization-specific terminology
Example: minimal dictionary
go goes going went
be is are was were being
have has having had
Example: domain-specific extension
microservice microservices
container containers containerized
kubernetes kubernetes
Common pitfalls
Mixing cases
Run running Runs ❌
→ normalized to lowercase, but inconsistent input is error-prone
Multiple stems on one line
run running connect ❌
→ connect becomes a variant of run, which is incorrect
Hidden comments
run running //comment runs ❌
→ everything after // is ignored
When to use this format
This format is suitable for:
- curated linguistic datasets
- exported morphological dictionaries
- domain-specific vocabularies
- generated
(word, stem)pairs from corpora
Next steps
Summary
Radixor dictionaries are intentionally simple:
- one line per stem
- whitespace-separated tokens
- optional remarks
- no embedded rules
This simplicity enables:
- easy generation
- fast parsing
- deterministic behavior
- efficient compilation into compact patch-command tries