Files

Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening

2026-04-13 02:10:46 +02:00

5.1 KiB

Raw Blame History

Dictionary Format

← Back to README.md

Radixor uses a simple, line-oriented dictionary format to define mappings between word forms and their canonical stems.

This format is intentionally minimal, language-agnostic, and easy to generate from existing linguistic resources or corpora.

Overview

Each logical line defines:

one canonical stem
zero or more word variants belonging to that stem

stem variant1 variant2 variant3 ...

At compile time:

each variant is converted into a patch command transforming the variant into the stem
the stem itself may optionally be stored as a no-op mapping

Basic example

run running runs ran
connect connected connecting connection
analyze analyzing analysed analyses

This defines:

Stem	Variants
run	running, runs, ran
connect	connected, connecting, connection
analyze	analyzing, analysed, analyses

Syntax rules

1. Tokenization

Tokens are separated by whitespace
Multiple spaces and tabs are treated as a single separator
Leading and trailing whitespace is ignored

2. First token is the stem

The first token on each line is always the canonical stem
All following tokens are treated as variants of that stem

3. Case normalization

All input is normalized to lowercase using Locale.ROOT
Dictionaries should ideally already be lowercase to avoid ambiguity

4. Empty lines

Empty lines are ignored

5. Duplicate variants

Duplicate variants are allowed but have no additional effect
Frequency is determined by occurrence across the entire dataset

Remarks (comments)

The parser supports both full-line and trailing remarks.

Supported remark markers

#
//

Examples

run running runs ran   # English verb forms
connect connected connecting  // basic forms

Everything after the first occurrence of a remark marker is ignored.

Important note

Remark markers are not escaped. If # or // appear in a token, they will terminate the line.

Storing the original form

When compiling, you may enable:

--store-original

This causes the stem itself to be stored using a no-op patch command.

Example:

run running runs

With --store-original, this implicitly includes:

run -> run

This is useful when:

the input may already be normalized
you want stable identity mappings
you want to avoid missing entries for canonical forms

Frequency and ordering

Radixor tracks local frequencies of values.

Frequency is determined by:

how many times a mapping appears during construction
merging behavior during reduction

When multiple stems exist for a word:

results are ordered by descending frequency
ties are resolved deterministically:
1. shorter textual representation wins
2. lexicographically smaller value wins
3. earlier insertion order wins

This guarantees stable and reproducible results.

Ambiguity and multiple stems

A word may legitimately map to more than one stem:

axes ax axe

This allows Radixor to represent ambiguity explicitly.

At runtime:

get(word) returns the preferred result
getAll(word) returns all candidates

Design guidelines

Keep stems consistent

Use a single canonical form:

run instead of mixing run / running
analyze vs analyse — pick one convention

Avoid noise

Do not include:

typos
extremely rare forms (unless required)
inconsistent normalization

Prefer completeness over clever rules

Radixor is data-driven:

more complete dictionaries → better results
no hidden rule system compensates for missing entries

Handle domain-specific vocabulary

You can extend dictionaries with:

product names
technical terms
organization-specific terminology

Example: minimal dictionary

go goes going went
be is are was were being
have has having had

Example: domain-specific extension

microservice microservices
container containers containerized
kubernetes kubernetes

Common pitfalls

Mixing cases

Run running Runs   ❌

→ normalized to lowercase, but inconsistent input is error-prone

Multiple stems on one line

run running connect   ❌

→ connect becomes a variant of run, which is incorrect

Hidden comments

run running //comment runs   ❌

→ everything after // is ignored

When to use this format

This format is suitable for:

curated linguistic datasets
exported morphological dictionaries
domain-specific vocabularies
generated (word, stem) pairs from corpora

Next steps

Summary

Radixor dictionaries are intentionally simple:

one line per stem
whitespace-separated tokens
optional remarks
no embedded rules

This simplicity enables:

easy generation
fast parsing
deterministic behavior
efficient compilation into compact patch-command tries

5.1 KiB Raw Blame History