# Dictionary Format Radixor uses a simple, line-oriented dictionary format to define mappings between **word forms** and their **canonical stems**. This format is intentionally minimal, language-agnostic, and easy to generate from existing linguistic resources or corpora. ## Overview Each logical line defines: - one **canonical stem** - zero or more **word variants** belonging to that stem ``` stem variant1 variant2 variant3 ... ``` At compile time: - each variant is converted into a **patch command** transforming the variant into the stem - the stem itself may optionally be stored as a **no-op mapping** ## Basic example ``` run running runs ran connect connected connecting connection analyze analyzing analysed analyses ``` This defines: | Stem | Variants | |----------|----------------------------------------| | run | running, runs, ran | | connect | connected, connecting, connection | | analyze | analyzing, analysed, analyses | ## Syntax rules ### 1. Tokenization - Tokens are separated by **whitespace** - Multiple spaces and tabs are treated as a single separator - Leading and trailing whitespace is ignored ### 2. First token is the stem - The **first token** on each line is always the canonical stem - All following tokens are treated as variants of that stem ### 3. Case normalization - All input is normalized to **lowercase using `Locale.ROOT`** - Dictionaries should ideally already be lowercase to avoid ambiguity ### 4. Empty lines - Empty lines are ignored ### 5. Duplicate variants - Duplicate variants are allowed but have no additional effect - Frequency is determined by occurrence across the entire dataset ## Remarks (comments) The parser supports both full-line and trailing remarks. ### Supported remark markers - `#` - `//` ### Examples ``` run running runs ran # English verb forms connect connected connecting // basic forms ``` Everything after the first occurrence of a remark marker is ignored. ### Important note Remark markers are not escaped. If `#` or `//` appear in a token, they will terminate the line. ## Storing the original form When compiling, you may enable: ``` --store-original ``` This causes the stem itself to be stored using a **no-op patch command**. Example: ``` run running runs ``` With `--store-original`, this implicitly includes: ``` run -> run ``` This is useful when: - the input may already be normalized - you want stable identity mappings - you want to avoid missing entries for canonical forms ## Frequency and ordering Radixor tracks **local frequencies** of values. Frequency is determined by: - how many times a mapping appears during construction - merging behavior during reduction When multiple stems exist for a word: - results are ordered by **descending frequency** - ties are resolved deterministically: 1. shorter textual representation wins 2. lexicographically smaller value wins 3. earlier insertion order wins This guarantees **stable and reproducible results**. ## Ambiguity and multiple stems A word may legitimately map to more than one stem: ``` axes ax axe ``` This allows Radixor to represent ambiguity explicitly. At runtime: - `get(word)` returns the **preferred result** - `getAll(word)` returns **all candidates** ## Design guidelines ### Keep stems consistent Use a single canonical form: - `run` instead of mixing `run` / `running` - `analyze` vs `analyse` — pick one convention ### Avoid noise Do not include: - typos - extremely rare forms (unless required) - inconsistent normalization ### Prefer completeness over clever rules Radixor is data-driven: - more complete dictionaries → better results - no hidden rule system compensates for missing entries ### Handle domain-specific vocabulary You can extend dictionaries with: - product names - technical terms - organization-specific terminology ## Example: minimal dictionary ``` go goes going went be is are was were being have has having had ``` ## Example: domain-specific extension ``` microservice microservices container containers containerized kubernetes kubernetes ``` ## Common pitfalls ### Mixing cases ``` Run running Runs ❌ ``` → normalized to lowercase, but inconsistent input is error-prone ### Multiple stems on one line ``` run running connect ❌ ``` → `connect` becomes a variant of `run`, which is incorrect ### Hidden comments ``` run running //comment runs ❌ ``` → everything after `//` is ignored ## When to use this format This format is suitable for: - curated linguistic datasets - exported morphological dictionaries - domain-specific vocabularies - generated `(word, stem)` pairs from corpora ## Next steps - [CLI compilation](cli-compilation.md) - [Programmatic usage](programmatic-usage.md) - [Quick start](quick-start.md) ## Summary Radixor dictionaries are intentionally simple: - one line per stem - whitespace-separated tokens - optional remarks - no embedded rules This simplicity enables: - easy generation - fast parsing - deterministic behavior - efficient compilation into compact patch-command tries