docs: sync and improvements

This commit is contained in:
2026-04-26 18:55:25 +02:00
parent 48f21cab72
commit 5a511374f3
13 changed files with 130 additions and 21 deletions

View File

@@ -8,7 +8,7 @@ This is the preferred preparation workflow when stemming should run against an a
The `Compile` tool performs the following steps:
1. reads the input dictionary in the standard Radixor stemmer format,
1. reads the input dictionary in the standard Radixor stemmer format, accepting either plain UTF-8 text or GZip-compressed UTF-8 text,
2. parses each line into a canonical stem column and its known variant columns,
3. converts variants into patch commands,
4. builds a mutable trie of patch-command values,
@@ -50,7 +50,7 @@ The CLI supports the following arguments:
Path to the source dictionary file.
The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The input may be plain UTF-8 text or GZip-compressed UTF-8 text; compression is detected from the stream header rather than the file extension. The parser processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
Example:
@@ -110,7 +110,7 @@ This option is intended for right-to-left languages where affix behavior should
### `--case-processing-mode <mode>`
Controls dictionary key normalization during compilation and lookup.
Controls dictionary key normalization during compilation and lookup. The setting is stored in persisted trie metadata and is therefore available to runtime lookup after binary loading.
Supported values are:
@@ -205,7 +205,7 @@ The CLI is best used as a preparation step during packaging, deployment, or cont
A `.radixor.gz` file should be handled as a versioned output artifact. It represents a specific dictionary state, a specific reduction mode, and, where relevant, specific dominant-result thresholds.
Compiled tries also persist a human-readable metadata block (`key=value` lines) that includes traversal direction, RTL indicator, reduction mode, case-processing mode, and dominant thresholds. After decompression, you can inspect this block directly to identify what dictionary/trie configuration the artifact contains.
Compiled tries also persist a human-readable metadata block (`key=value` lines) that includes format version, traversal direction, RTL indicator, reduction mode, dominant thresholds, diacritic-processing mode, and case-processing mode. After decompression, you can inspect this block directly to identify what dictionary/trie configuration the artifact contains. The current CLI uses `DiacriticProcessingMode.AS_IS`; custom diacritic stripping is available through the programmatic builder and loader APIs rather than through a CLI flag.
### Choose reduction mode deliberately

View File

@@ -127,15 +127,21 @@ is processed the same way as:
run running runs ran
```
## Character set and practical convention
## Character set, compression, and normalization
Dictionary files are read as UTF-8 text.
Dictionary files are read as UTF-8 text. Files loaded through `StemmerPatchTrieLoader.load(Path, ...)` may be either plain UTF-8 text or GZip-compressed UTF-8 text; the loader detects GZip input from the stream header instead of relying on the file extension. Bundled dictionaries are stored as GZip resources and are decoded as UTF-8 after decompression.
From the perspective of the parser and the stemming algorithm, the format is not restricted to plain ASCII tokens. The parser accepts ordinary Java `String` data, and the trie itself works with general character sequences rather than with an ASCII-only internal model. In principle, this means the system could process diacritic and non-diacritic forms alike, and it could also store forms with inconsistently used diacritics.
The parser and trie are not restricted to ASCII. Dictionary items are ordinary Java `String` values, and trie traversal works over Java `char` sequences. This supports Latin-script data with diacritics, Cyrillic data, Hebrew, Persian, Yiddish, and other scripts represented in UTF-8, subject to the normal Java `String` model and the projects traversal configuration.
In practice, however, the format is currently best understood as **primarily intended for classical basic ASCII lexical input**, especially in the traditional stemming style where language data is normalized into plain characters in the ASCII range up to character code 127. This convention is particularly relevant for languages whose original orthography includes diacritics but whose stemming dictionaries are commonly maintained in normalized non-diacritic form.
Case normalization is controlled by `CaseProcessingMode`. The default `LOWERCASE_WITH_LOCALE_ROOT` mode lowercases the line before columns are split into dictionary items. `AS_IS` preserves the original casing.
Future versions may expand the documentation and operational guidance for dictionaries that intentionally preserve diacritics. At present, that workflow is not the primary documented use case, not because the algorithm fundamentally forbids it, but because a concrete project requirement for such support has not yet emerged.
Diacritic normalization is controlled at trie-build and lookup time by `DiacriticProcessingMode`:
- `AS_IS` preserves dictionary and lookup keys exactly after case handling,
- `REMOVE` strips supported diacritics and common Latin ligatures on both insertion and lookup paths,
- `AS_IS_AND_STRIPPED_FALLBACK` is declared in the public model but is not implemented yet and raises `UnsupportedOperationException`.
For reliable production behavior, choose one normalization policy deliberately and apply it consistently. Normalized ASCII dictionaries remain a practical convention for some legacy stemming data, but they are not a format requirement.
## Distinct stem and variant semantics
@@ -206,7 +212,7 @@ The current dictionary format intentionally stays minimal:
- no explicit ambiguity syntax,
- no sectioning or nested structure.
Each dictionary item is simply one tab-separated word form after remark stripping and lowercasing.
Each dictionary item is simply one tab-separated word form after remark stripping and the configured case and diacritic normalization.
## Authoring guidance
@@ -218,7 +224,7 @@ For reliable results, keep dictionaries:
- encoded in UTF-8,
- easy to audit in plain text form.
For most current deployments, it is sensible to keep dictionary content in normalized basic ASCII form unless there is a clear requirement to preserve diacritics end-to-end.
For most deployments, it is sensible to choose either preserved UTF-8 forms or a normalized ASCII/diacritic-stripped convention and keep that choice consistent across dictionary authoring, compilation, and runtime lookup.
## Relationship to other documentation

View File

@@ -32,7 +32,7 @@ The `storeOriginal` flag controls whether the canonical stem is inserted as a no
## Load a textual dictionary
Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input case normalization is controlled by `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.
Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The input may be plain UTF-8 text or GZip-compressed UTF-8 text; the loader detects GZip data from the stream header. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input case normalization is controlled by `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.
```java
import java.io.IOException;
@@ -59,6 +59,8 @@ public final class LoadTextDictionaryExample {
}
```
Additional `StemmerPatchTrieLoader.load(...)` overloads let callers provide explicit `WordTraversalDirection`, `CaseProcessingMode`, `DiacriticProcessingMode`, or a complete `TrieMetadata` instance. Use those overloads when a custom dictionary must be compiled with forward traversal for right-to-left languages, case-sensitive keys, or diacritic stripping.
## Load a compiled binary artifact
Binary loading is typically the preferred runtime path because it avoids reparsing the textual source and skips the preparation step entirely.
@@ -83,7 +85,7 @@ public final class LoadBinaryExample {
}
```
The binary format is the native `FrequencyTrie` serialization wrapped in GZip compression.
The binary format is the native `FrequencyTrie` serialization wrapped in GZip compression. It includes persisted `TrieMetadata`, so lookup after loading uses the traversal, case-processing, diacritic-processing, and reduction settings captured when the trie was compiled.
## Build directly with a mutable builder
@@ -108,7 +110,7 @@ public final class BuilderExample {
final FrequencyTrie.Builder<String> builder =
new FrequencyTrie.Builder<>(String[]::new, settings);
final PatchCommandEncoder encoder = new PatchCommandEncoder();
final PatchCommandEncoder encoder = PatchCommandEncoder.builder().build();
builder.put("running", encoder.encode("running", "run"));
builder.put("runs", encoder.encode("runs", "run"));

View File

@@ -69,7 +69,7 @@ public final class LoadBinaryStemmerExample {
### Build or extend a stemmer from dictionary data
Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace, and supports line remarks introduced by `#` or `//`.
Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The input may be plain UTF-8 text or GZip-compressed UTF-8 text when loaded from a filesystem path. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace around columns, supports line remarks introduced by `#` or `//`, and skips dictionary items that contain embedded whitespace.
This path is also relevant when you extend an existing compiled stemmer with additional domain-specific entries and rebuild a new compact artifact.