docs: sync and improvements

2026-04-26 18:55:25 +02:00
parent 48f21cab72
commit 5a511374f3
13 changed files with 130 additions and 21 deletions
--- a/docs/cli-compilation.md
+++ b/docs/cli-compilation.md
@@ -8,7 +8,7 @@ This is the preferred preparation workflow when stemming should run against an a
 The `Compile` tool performs the following steps:
-1. reads the input dictionary in the standard Radixor stemmer format,
+1. reads the input dictionary in the standard Radixor stemmer format, accepting either plain UTF-8 text or GZip-compressed UTF-8 text,
 2. parses each line into a canonical stem column and its known variant columns,
 3. converts variants into patch commands,
 4. builds a mutable trie of patch-command values,
@@ -50,7 +50,7 @@ The CLI supports the following arguments:
 Path to the source dictionary file.
-The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
+The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The input may be plain UTF-8 text or GZip-compressed UTF-8 text; compression is detected from the stream header rather than the file extension. The parser processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
 Example:
@@ -110,7 +110,7 @@ This option is intended for right-to-left languages where affix behavior should
 ### `--case-processing-mode <mode>`
-Controls dictionary key normalization during compilation and lookup.
+Controls dictionary key normalization during compilation and lookup. The setting is stored in persisted trie metadata and is therefore available to runtime lookup after binary loading.
 Supported values are:
@@ -205,7 +205,7 @@ The CLI is best used as a preparation step during packaging, deployment, or cont
 A `.radixor.gz` file should be handled as a versioned output artifact. It represents a specific dictionary state, a specific reduction mode, and, where relevant, specific dominant-result thresholds.
-Compiled tries also persist a human-readable metadata block (`key=value` lines) that includes traversal direction, RTL indicator, reduction mode, case-processing mode, and dominant thresholds. After decompression, you can inspect this block directly to identify what dictionary/trie configuration the artifact contains.
+Compiled tries also persist a human-readable metadata block (`key=value` lines) that includes format version, traversal direction, RTL indicator, reduction mode, dominant thresholds, diacritic-processing mode, and case-processing mode. After decompression, you can inspect this block directly to identify what dictionary/trie configuration the artifact contains. The current CLI uses `DiacriticProcessingMode.AS_IS`; custom diacritic stripping is available through the programmatic builder and loader APIs rather than through a CLI flag.
 ### Choose reduction mode deliberately
--- a/docs/dictionary-format.md
+++ b/docs/dictionary-format.md
@@ -127,15 +127,21 @@ is processed the same way as:
 run	running	runs	ran
 ```
-## Character set and practical convention
+## Character set, compression, and normalization
-Dictionary files are read as UTF-8 text.
+Dictionary files are read as UTF-8 text. Files loaded through `StemmerPatchTrieLoader.load(Path, ...)` may be either plain UTF-8 text or GZip-compressed UTF-8 text; the loader detects GZip input from the stream header instead of relying on the file extension. Bundled dictionaries are stored as GZip resources and are decoded as UTF-8 after decompression.
-From the perspective of the parser and the stemming algorithm, the format is not restricted to plain ASCII tokens. The parser accepts ordinary Java `String` data, and the trie itself works with general character sequences rather than with an ASCII-only internal model. In principle, this means the system could process diacritic and non-diacritic forms alike, and it could also store forms with inconsistently used diacritics.
+The parser and trie are not restricted to ASCII. Dictionary items are ordinary Java `String` values, and trie traversal works over Java `char` sequences. This supports Latin-script data with diacritics, Cyrillic data, Hebrew, Persian, Yiddish, and other scripts represented in UTF-8, subject to the normal Java `String` model and the project’s traversal configuration.
-In practice, however, the format is currently best understood as **primarily intended for classical basic ASCII lexical input**, especially in the traditional stemming style where language data is normalized into plain characters in the ASCII range up to character code 127. This convention is particularly relevant for languages whose original orthography includes diacritics but whose stemming dictionaries are commonly maintained in normalized non-diacritic form.
+Case normalization is controlled by `CaseProcessingMode`. The default `LOWERCASE_WITH_LOCALE_ROOT` mode lowercases the line before columns are split into dictionary items. `AS_IS` preserves the original casing.
-Future versions may expand the documentation and operational guidance for dictionaries that intentionally preserve diacritics. At present, that workflow is not the primary documented use case, not because the algorithm fundamentally forbids it, but because a concrete project requirement for such support has not yet emerged.
+Diacritic normalization is controlled at trie-build and lookup time by `DiacriticProcessingMode`:
 - `AS_IS` preserves dictionary and lookup keys exactly after case handling,
 - `REMOVE` strips supported diacritics and common Latin ligatures on both insertion and lookup paths,
 - `AS_IS_AND_STRIPPED_FALLBACK` is declared in the public model but is not implemented yet and raises `UnsupportedOperationException`.
 For reliable production behavior, choose one normalization policy deliberately and apply it consistently. Normalized ASCII dictionaries remain a practical convention for some legacy stemming data, but they are not a format requirement.
 ## Distinct stem and variant semantics
@@ -206,7 +212,7 @@ The current dictionary format intentionally stays minimal:
 - no explicit ambiguity syntax,
 - no sectioning or nested structure.
-Each dictionary item is simply one tab-separated word form after remark stripping and lowercasing.
+Each dictionary item is simply one tab-separated word form after remark stripping and the configured case and diacritic normalization.
 ## Authoring guidance
@@ -218,7 +224,7 @@ For reliable results, keep dictionaries:
 - encoded in UTF-8,
 - easy to audit in plain text form.
-For most current deployments, it is sensible to keep dictionary content in normalized basic ASCII form unless there is a clear requirement to preserve diacritics end-to-end.
+For most deployments, it is sensible to choose either preserved UTF-8 forms or a normalized ASCII/diacritic-stripped convention and keep that choice consistent across dictionary authoring, compilation, and runtime lookup.
 ## Relationship to other documentation
--- a/docs/programmatic-loading-and-building.md
+++ b/docs/programmatic-loading-and-building.md
@@ -32,7 +32,7 @@ The `storeOriginal` flag controls whether the canonical stem is inserted as a no
 ## Load a textual dictionary
-Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input case normalization is controlled by `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.
+Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The input may be plain UTF-8 text or GZip-compressed UTF-8 text; the loader detects GZip data from the stream header. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input case normalization is controlled by `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.
 ```java
 import java.io.IOException;
@@ -59,6 +59,8 @@ public final class LoadTextDictionaryExample {
 }
 ```
 Additional `StemmerPatchTrieLoader.load(...)` overloads let callers provide explicit `WordTraversalDirection`, `CaseProcessingMode`, `DiacriticProcessingMode`, or a complete `TrieMetadata` instance. Use those overloads when a custom dictionary must be compiled with forward traversal for right-to-left languages, case-sensitive keys, or diacritic stripping.
 ## Load a compiled binary artifact
 Binary loading is typically the preferred runtime path because it avoids reparsing the textual source and skips the preparation step entirely.
@@ -83,7 +85,7 @@ public final class LoadBinaryExample {
 }
 ```
-The binary format is the native `FrequencyTrie` serialization wrapped in GZip compression.
+The binary format is the native `FrequencyTrie` serialization wrapped in GZip compression. It includes persisted `TrieMetadata`, so lookup after loading uses the traversal, case-processing, diacritic-processing, and reduction settings captured when the trie was compiled.
 ## Build directly with a mutable builder
@@ -108,7 +110,7 @@ public final class BuilderExample {
        final FrequencyTrie.Builder<String> builder =
                new FrequencyTrie.Builder<>(String[]::new, settings);
-        final PatchCommandEncoder encoder = new PatchCommandEncoder();
+        final PatchCommandEncoder encoder = PatchCommandEncoder.builder().build();
        builder.put("running", encoder.encode("running", "run"));
        builder.put("runs", encoder.encode("runs", "run"));
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -69,7 +69,7 @@ public final class LoadBinaryStemmerExample {
 ### Build or extend a stemmer from dictionary data
-Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace, and supports line remarks introduced by `#` or `//`.
+Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The input may be plain UTF-8 text or GZip-compressed UTF-8 text when loaded from a filesystem path. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace around columns, supports line remarks introduced by `#` or `//`, and skips dictionary items that contain embedded whitespace.
 This path is also relevant when you extend an existing compiled stemmer with additional domain-specific entries and rebuild a new compact artifact.
--- a/src/main/java/org/egothor/stemmer/Compile.java
+++ b/src/main/java/org/egothor/stemmer/Compile.java
@@ -61,6 +61,7 @@ import java.util.logging.Logger;
 * --output &lt;file&gt;
 * --reduction-mode &lt;mode&gt;
 * [--store-original]
 * [--right-to-left]
 * [--case-processing-mode &lt;mode&gt;]
 * [--dominant-winner-min-percent &lt;1..100&gt;]
 * [--dominant-winner-over-second-ratio &lt;1..n&gt;]
--- a/src/main/java/org/egothor/stemmer/DiacriticStripper.java
+++ b/src/main/java/org/egothor/stemmer/DiacriticStripper.java
@@ -85,10 +85,25 @@ final class DiacriticStripper {
        registerSingle("Þ", 'T');
    }
    /**
     * Utility class.
     */
    private DiacriticStripper() {
        throw new AssertionError("No instances.");
    }
    /**
     * Removes supported diacritic marks and common Latin ligatures from the supplied
     * text.
     *
     * <p>
     * The method returns the original {@link String} instance when no replacement is
     * required, avoiding an unnecessary allocation on the common ASCII path.
     * </p>
     *
     * @param input text to normalize
     * @return normalized text, or {@code input} itself when it is already unchanged
     */
    /* default */ static String strip(final String input) {
        StringBuilder normalized = null;
@@ -116,6 +131,13 @@ final class DiacriticStripper {
        return normalized.toString();
    }
    /**
     * Returns the replacement text for one non-ASCII character.
     *
     * @param source source character
     * @return replacement text, or {@code null} when the character should be kept
     *         unchanged
     */
    @SuppressWarnings("PMD.AvoidLiteralsInIfCondition")
    private static String replacementFor(final char source) {
        if (source <= 0x007F) {
@@ -161,6 +183,12 @@ final class DiacriticStripper {
        return ascii.toString();
    }
    /**
     * Registers one-character replacements for a set of source characters.
     *
     * @param sourceCharacters characters to replace
     * @param replacement      replacement character
     */
    private static void registerSingle(final String sourceCharacters, final char replacement) {
        for (int index = 0; index < sourceCharacters.length(); index++) {
            DIRECT_REPLACEMENTS[sourceCharacters.charAt(index)] = replacement;
--- a/src/main/java/org/egothor/stemmer/FrequencyTrie.java
+++ b/src/main/java/org/egothor/stemmer/FrequencyTrie.java
@@ -140,7 +140,7 @@ public final class FrequencyTrie<V> {
     *
     * @param arrayFactory array factory
     * @param root         compiled root node
-     * @param traversalDirection logical key traversal direction
+     * @param metadata     trie metadata describing lookup and persistence semantics
     * @throws NullPointerException if any argument is {@code null}
     */
    private FrequencyTrie(final IntFunction<V[]> arrayFactory, final CompiledNode<V> root,
@@ -922,6 +922,13 @@ public final class FrequencyTrie<V> {
            return this;
        }
        /**
         * Applies build-time dictionary-key normalization according to the builder
         * configuration.
         *
         * @param key dictionary key
         * @return normalized key for trie insertion
         */
        private String normalizeDictionaryKey(final String key) {
            String normalized = key;
--- a/src/main/java/org/egothor/stemmer/PatchCommandEncoder.java
+++ b/src/main/java/org/egothor/stemmer/PatchCommandEncoder.java
@@ -737,6 +737,7 @@ public final class PatchCommandEncoder {
     * @param targetCharacters target characters
     * @param sourceLength     source length
     * @param targetLength     target length
     * @param direction        traversal direction used to compare characters
     */
    private void fillMatrices(final char[] sourceCharacters, final char[] targetCharacters, final int sourceLength,
            final int targetLength, final WordTraversalDirection direction) {
@@ -988,6 +989,14 @@ public final class PatchCommandEncoder {
        private int replaceCost = 1;
        private int matchCost; // = 0
        /**
         * Creates a builder initialized with the default Egothor-compatible cost model
         * and backward traversal.
         */
        public Builder() {
            // Default values are assigned in field initializers.
        }
        /**
         * Sets traversal direction used by the created encoder.
         *
@@ -1011,7 +1020,7 @@ public final class PatchCommandEncoder {
        }
        /**
-         * Sets cost of an delete operation.
+         * Sets cost of a delete operation.
         * 
         * @param value cost of the operation
         * @return this builder
@@ -1022,7 +1031,7 @@ public final class PatchCommandEncoder {
        }
        /**
-         * Sets cost of an replace operation.
+         * Sets cost of a replace operation.
         * 
         * @param value cost of the operation
         * @return this builder
@@ -1033,7 +1042,7 @@ public final class PatchCommandEncoder {
        }
        /**
-         * Sets cost of an skip operation.
+         * Sets cost of a match operation.
         * 
         * @param value cost of the operation
         * @return this builder
--- a/src/main/java/org/egothor/stemmer/TrieMetadata.java
+++ b/src/main/java/org/egothor/stemmer/TrieMetadata.java
@@ -217,6 +217,14 @@ public record TrieMetadata(int formatVersion, WordTraversalDirection traversalDi
                diacriticProcessingMode, caseProcessingMode);
    }
    /**
     * Returns a required metadata entry from a parsed text block.
     *
     * @param entries parsed metadata entries
     * @param key     required entry key
     * @return non-blank entry value
     * @throws IllegalArgumentException if the entry is absent or blank
     */
    private static String requireEntry(final Map<String, String> entries, final String key) {
        final String value = entries.get(key);
        if (value == null || value.isBlank()) {
--- a/src/main/java/org/egothor/stemmer/trie/ChildDescriptor.java
+++ b/src/main/java/org/egothor/stemmer/trie/ChildDescriptor.java
@@ -60,11 +60,23 @@ import java.util.Objects;
        this.childSignature = childSignature;
    }
    /**
     * Returns a hash code consistent with descriptor equality.
     *
     * @return descriptor hash code
     */
    @Override
    public int hashCode() {
        return Objects.hash(this.edge, this.childSignature);
    }
    /**
     * Compares this descriptor with another object.
     *
     * @param other object to compare with
     * @return {@code true} when both descriptors represent the same semantic
     *         reduction identity
     */
    @Override
    public boolean equals(final Object other) {
        if (this == other) {
--- a/src/main/java/org/egothor/stemmer/trie/DominantLocalDescriptor.java
+++ b/src/main/java/org/egothor/stemmer/trie/DominantLocalDescriptor.java
@@ -53,11 +53,23 @@ import java.util.Objects;
        this.dominantValue = dominantValue;
    }
    /**
     * Returns a hash code consistent with descriptor equality.
     *
     * @return descriptor hash code
     */
    @Override
    public int hashCode() {
        return Objects.hashCode(this.dominantValue);
    }
    /**
     * Compares this descriptor with another object.
     *
     * @param other object to compare with
     * @return {@code true} when both descriptors represent the same semantic
     *         reduction identity
     */
    @Override
    public boolean equals(final Object other) {
        if (this == other) {
--- a/src/main/java/org/egothor/stemmer/trie/RankedLocalDescriptor.java
+++ b/src/main/java/org/egothor/stemmer/trie/RankedLocalDescriptor.java
@@ -65,11 +65,23 @@ import java.util.List;
                Collections.unmodifiableList(Arrays.asList(Arrays.copyOf(orderedValues, orderedValues.length))));
    }
    /**
     * Returns a hash code consistent with descriptor equality.
     *
     * @return descriptor hash code
     */
    @Override
    public int hashCode() {
        return this.orderedValues.hashCode();
    }
    /**
     * Compares this descriptor with another object.
     *
     * @param other object to compare with
     * @return {@code true} when both descriptors represent the same semantic
     *         reduction identity
     */
    @Override
    public boolean equals(final Object other) {
        if (this == other) {
--- a/src/main/java/org/egothor/stemmer/trie/UnorderedLocalDescriptor.java
+++ b/src/main/java/org/egothor/stemmer/trie/UnorderedLocalDescriptor.java
@@ -67,11 +67,23 @@ import java.util.Set;
        return new UnorderedLocalDescriptor(Collections.unmodifiableSet(distinct));
    }
    /**
     * Returns a hash code consistent with descriptor equality.
     *
     * @return descriptor hash code
     */
    @Override
    public int hashCode() {
        return this.distinctValues.hashCode();
    }
    /**
     * Compares this descriptor with another object.
     *
     * @param other object to compare with
     * @return {@code true} when both descriptors represent the same semantic
     *         reduction identity
     */
    @Override
    public boolean equals(final Object other) {
        if (this == other) {