From 5a511374f374f92eb9f5aba31a54c743991b4584 Mon Sep 17 00:00:00 2001 From: Leo Galambos Date: Sun, 26 Apr 2026 18:55:25 +0200 Subject: [PATCH] docs: sync and improvements --- docs/cli-compilation.md | 8 +++--- docs/dictionary-format.md | 20 ++++++++----- docs/programmatic-loading-and-building.md | 8 ++++-- docs/quick-start.md | 2 +- .../java/org/egothor/stemmer/Compile.java | 1 + .../egothor/stemmer/DiacriticStripper.java | 28 +++++++++++++++++++ .../org/egothor/stemmer/FrequencyTrie.java | 13 +++++++-- .../egothor/stemmer/PatchCommandEncoder.java | 15 ++++++++-- .../org/egothor/stemmer/TrieMetadata.java | 8 ++++++ .../egothor/stemmer/trie/ChildDescriptor.java | 12 ++++++++ .../stemmer/trie/DominantLocalDescriptor.java | 12 ++++++++ .../stemmer/trie/RankedLocalDescriptor.java | 12 ++++++++ .../trie/UnorderedLocalDescriptor.java | 12 ++++++++ 13 files changed, 130 insertions(+), 21 deletions(-) diff --git a/docs/cli-compilation.md b/docs/cli-compilation.md index ff26869..f650f2e 100644 --- a/docs/cli-compilation.md +++ b/docs/cli-compilation.md @@ -8,7 +8,7 @@ This is the preferred preparation workflow when stemming should run against an a The `Compile` tool performs the following steps: -1. reads the input dictionary in the standard Radixor stemmer format, +1. reads the input dictionary in the standard Radixor stemmer format, accepting either plain UTF-8 text or GZip-compressed UTF-8 text, 2. parses each line into a canonical stem column and its known variant columns, 3. converts variants into patch commands, 4. builds a mutable trie of patch-command values, @@ -50,7 +50,7 @@ The CLI supports the following arguments: Path to the source dictionary file. -The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries. +The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The input may be plain UTF-8 text or GZip-compressed UTF-8 text; compression is detected from the stream header rather than the file extension. The parser processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries. Example: @@ -110,7 +110,7 @@ This option is intended for right-to-left languages where affix behavior should ### `--case-processing-mode ` -Controls dictionary key normalization during compilation and lookup. +Controls dictionary key normalization during compilation and lookup. The setting is stored in persisted trie metadata and is therefore available to runtime lookup after binary loading. Supported values are: @@ -205,7 +205,7 @@ The CLI is best used as a preparation step during packaging, deployment, or cont A `.radixor.gz` file should be handled as a versioned output artifact. It represents a specific dictionary state, a specific reduction mode, and, where relevant, specific dominant-result thresholds. -Compiled tries also persist a human-readable metadata block (`key=value` lines) that includes traversal direction, RTL indicator, reduction mode, case-processing mode, and dominant thresholds. After decompression, you can inspect this block directly to identify what dictionary/trie configuration the artifact contains. +Compiled tries also persist a human-readable metadata block (`key=value` lines) that includes format version, traversal direction, RTL indicator, reduction mode, dominant thresholds, diacritic-processing mode, and case-processing mode. After decompression, you can inspect this block directly to identify what dictionary/trie configuration the artifact contains. The current CLI uses `DiacriticProcessingMode.AS_IS`; custom diacritic stripping is available through the programmatic builder and loader APIs rather than through a CLI flag. ### Choose reduction mode deliberately diff --git a/docs/dictionary-format.md b/docs/dictionary-format.md index 7082c13..eb0aa11 100644 --- a/docs/dictionary-format.md +++ b/docs/dictionary-format.md @@ -127,15 +127,21 @@ is processed the same way as: run running runs ran ``` -## Character set and practical convention +## Character set, compression, and normalization -Dictionary files are read as UTF-8 text. +Dictionary files are read as UTF-8 text. Files loaded through `StemmerPatchTrieLoader.load(Path, ...)` may be either plain UTF-8 text or GZip-compressed UTF-8 text; the loader detects GZip input from the stream header instead of relying on the file extension. Bundled dictionaries are stored as GZip resources and are decoded as UTF-8 after decompression. -From the perspective of the parser and the stemming algorithm, the format is not restricted to plain ASCII tokens. The parser accepts ordinary Java `String` data, and the trie itself works with general character sequences rather than with an ASCII-only internal model. In principle, this means the system could process diacritic and non-diacritic forms alike, and it could also store forms with inconsistently used diacritics. +The parser and trie are not restricted to ASCII. Dictionary items are ordinary Java `String` values, and trie traversal works over Java `char` sequences. This supports Latin-script data with diacritics, Cyrillic data, Hebrew, Persian, Yiddish, and other scripts represented in UTF-8, subject to the normal Java `String` model and the project’s traversal configuration. -In practice, however, the format is currently best understood as **primarily intended for classical basic ASCII lexical input**, especially in the traditional stemming style where language data is normalized into plain characters in the ASCII range up to character code 127. This convention is particularly relevant for languages whose original orthography includes diacritics but whose stemming dictionaries are commonly maintained in normalized non-diacritic form. +Case normalization is controlled by `CaseProcessingMode`. The default `LOWERCASE_WITH_LOCALE_ROOT` mode lowercases the line before columns are split into dictionary items. `AS_IS` preserves the original casing. -Future versions may expand the documentation and operational guidance for dictionaries that intentionally preserve diacritics. At present, that workflow is not the primary documented use case, not because the algorithm fundamentally forbids it, but because a concrete project requirement for such support has not yet emerged. +Diacritic normalization is controlled at trie-build and lookup time by `DiacriticProcessingMode`: + +- `AS_IS` preserves dictionary and lookup keys exactly after case handling, +- `REMOVE` strips supported diacritics and common Latin ligatures on both insertion and lookup paths, +- `AS_IS_AND_STRIPPED_FALLBACK` is declared in the public model but is not implemented yet and raises `UnsupportedOperationException`. + +For reliable production behavior, choose one normalization policy deliberately and apply it consistently. Normalized ASCII dictionaries remain a practical convention for some legacy stemming data, but they are not a format requirement. ## Distinct stem and variant semantics @@ -206,7 +212,7 @@ The current dictionary format intentionally stays minimal: - no explicit ambiguity syntax, - no sectioning or nested structure. -Each dictionary item is simply one tab-separated word form after remark stripping and lowercasing. +Each dictionary item is simply one tab-separated word form after remark stripping and the configured case and diacritic normalization. ## Authoring guidance @@ -218,7 +224,7 @@ For reliable results, keep dictionaries: - encoded in UTF-8, - easy to audit in plain text form. -For most current deployments, it is sensible to keep dictionary content in normalized basic ASCII form unless there is a clear requirement to preserve diacritics end-to-end. +For most deployments, it is sensible to choose either preserved UTF-8 forms or a normalized ASCII/diacritic-stripped convention and keep that choice consistent across dictionary authoring, compilation, and runtime lookup. ## Relationship to other documentation diff --git a/docs/programmatic-loading-and-building.md b/docs/programmatic-loading-and-building.md index 1cf6c5d..45c93d3 100644 --- a/docs/programmatic-loading-and-building.md +++ b/docs/programmatic-loading-and-building.md @@ -32,7 +32,7 @@ The `storeOriginal` flag controls whether the canonical stem is inserted as a no ## Load a textual dictionary -Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input case normalization is controlled by `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics. +Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The input may be plain UTF-8 text or GZip-compressed UTF-8 text; the loader detects GZip data from the stream header. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input case normalization is controlled by `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics. ```java import java.io.IOException; @@ -59,6 +59,8 @@ public final class LoadTextDictionaryExample { } ``` +Additional `StemmerPatchTrieLoader.load(...)` overloads let callers provide explicit `WordTraversalDirection`, `CaseProcessingMode`, `DiacriticProcessingMode`, or a complete `TrieMetadata` instance. Use those overloads when a custom dictionary must be compiled with forward traversal for right-to-left languages, case-sensitive keys, or diacritic stripping. + ## Load a compiled binary artifact Binary loading is typically the preferred runtime path because it avoids reparsing the textual source and skips the preparation step entirely. @@ -83,7 +85,7 @@ public final class LoadBinaryExample { } ``` -The binary format is the native `FrequencyTrie` serialization wrapped in GZip compression. +The binary format is the native `FrequencyTrie` serialization wrapped in GZip compression. It includes persisted `TrieMetadata`, so lookup after loading uses the traversal, case-processing, diacritic-processing, and reduction settings captured when the trie was compiled. ## Build directly with a mutable builder @@ -108,7 +110,7 @@ public final class BuilderExample { final FrequencyTrie.Builder builder = new FrequencyTrie.Builder<>(String[]::new, settings); - final PatchCommandEncoder encoder = new PatchCommandEncoder(); + final PatchCommandEncoder encoder = PatchCommandEncoder.builder().build(); builder.put("running", encoder.encode("running", "run")); builder.put("runs", encoder.encode("runs", "run")); diff --git a/docs/quick-start.md b/docs/quick-start.md index a8fac02..06fb887 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -69,7 +69,7 @@ public final class LoadBinaryStemmerExample { ### Build or extend a stemmer from dictionary data -Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace, and supports line remarks introduced by `#` or `//`. +Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The input may be plain UTF-8 text or GZip-compressed UTF-8 text when loaded from a filesystem path. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace around columns, supports line remarks introduced by `#` or `//`, and skips dictionary items that contain embedded whitespace. This path is also relevant when you extend an existing compiled stemmer with additional domain-specific entries and rebuild a new compact artifact. diff --git a/src/main/java/org/egothor/stemmer/Compile.java b/src/main/java/org/egothor/stemmer/Compile.java index 7981a5a..9a8c9fe 100644 --- a/src/main/java/org/egothor/stemmer/Compile.java +++ b/src/main/java/org/egothor/stemmer/Compile.java @@ -61,6 +61,7 @@ import java.util.logging.Logger; * --output <file> * --reduction-mode <mode> * [--store-original] + * [--right-to-left] * [--case-processing-mode <mode>] * [--dominant-winner-min-percent <1..100>] * [--dominant-winner-over-second-ratio <1..n>] diff --git a/src/main/java/org/egothor/stemmer/DiacriticStripper.java b/src/main/java/org/egothor/stemmer/DiacriticStripper.java index 47aaa04..35fe0d9 100644 --- a/src/main/java/org/egothor/stemmer/DiacriticStripper.java +++ b/src/main/java/org/egothor/stemmer/DiacriticStripper.java @@ -85,10 +85,25 @@ final class DiacriticStripper { registerSingle("Þ", 'T'); } + /** + * Utility class. + */ private DiacriticStripper() { throw new AssertionError("No instances."); } + /** + * Removes supported diacritic marks and common Latin ligatures from the supplied + * text. + * + *

+ * The method returns the original {@link String} instance when no replacement is + * required, avoiding an unnecessary allocation on the common ASCII path. + *

+ * + * @param input text to normalize + * @return normalized text, or {@code input} itself when it is already unchanged + */ /* default */ static String strip(final String input) { StringBuilder normalized = null; @@ -116,6 +131,13 @@ final class DiacriticStripper { return normalized.toString(); } + /** + * Returns the replacement text for one non-ASCII character. + * + * @param source source character + * @return replacement text, or {@code null} when the character should be kept + * unchanged + */ @SuppressWarnings("PMD.AvoidLiteralsInIfCondition") private static String replacementFor(final char source) { if (source <= 0x007F) { @@ -161,6 +183,12 @@ final class DiacriticStripper { return ascii.toString(); } + /** + * Registers one-character replacements for a set of source characters. + * + * @param sourceCharacters characters to replace + * @param replacement replacement character + */ private static void registerSingle(final String sourceCharacters, final char replacement) { for (int index = 0; index < sourceCharacters.length(); index++) { DIRECT_REPLACEMENTS[sourceCharacters.charAt(index)] = replacement; diff --git a/src/main/java/org/egothor/stemmer/FrequencyTrie.java b/src/main/java/org/egothor/stemmer/FrequencyTrie.java index 921bcaf..53aa3cf 100644 --- a/src/main/java/org/egothor/stemmer/FrequencyTrie.java +++ b/src/main/java/org/egothor/stemmer/FrequencyTrie.java @@ -138,9 +138,9 @@ public final class FrequencyTrie { /** * Creates a new compiled trie instance. * - * @param arrayFactory array factory - * @param root compiled root node - * @param traversalDirection logical key traversal direction + * @param arrayFactory array factory + * @param root compiled root node + * @param metadata trie metadata describing lookup and persistence semantics * @throws NullPointerException if any argument is {@code null} */ private FrequencyTrie(final IntFunction arrayFactory, final CompiledNode root, @@ -922,6 +922,13 @@ public final class FrequencyTrie { return this; } + /** + * Applies build-time dictionary-key normalization according to the builder + * configuration. + * + * @param key dictionary key + * @return normalized key for trie insertion + */ private String normalizeDictionaryKey(final String key) { String normalized = key; diff --git a/src/main/java/org/egothor/stemmer/PatchCommandEncoder.java b/src/main/java/org/egothor/stemmer/PatchCommandEncoder.java index 1c9d8e1..117f8ea 100644 --- a/src/main/java/org/egothor/stemmer/PatchCommandEncoder.java +++ b/src/main/java/org/egothor/stemmer/PatchCommandEncoder.java @@ -737,6 +737,7 @@ public final class PatchCommandEncoder { * @param targetCharacters target characters * @param sourceLength source length * @param targetLength target length + * @param direction traversal direction used to compare characters */ private void fillMatrices(final char[] sourceCharacters, final char[] targetCharacters, final int sourceLength, final int targetLength, final WordTraversalDirection direction) { @@ -988,6 +989,14 @@ public final class PatchCommandEncoder { private int replaceCost = 1; private int matchCost; // = 0 + /** + * Creates a builder initialized with the default Egothor-compatible cost model + * and backward traversal. + */ + public Builder() { + // Default values are assigned in field initializers. + } + /** * Sets traversal direction used by the created encoder. * @@ -1011,7 +1020,7 @@ public final class PatchCommandEncoder { } /** - * Sets cost of an delete operation. + * Sets cost of a delete operation. * * @param value cost of the operation * @return this builder @@ -1022,7 +1031,7 @@ public final class PatchCommandEncoder { } /** - * Sets cost of an replace operation. + * Sets cost of a replace operation. * * @param value cost of the operation * @return this builder @@ -1033,7 +1042,7 @@ public final class PatchCommandEncoder { } /** - * Sets cost of an skip operation. + * Sets cost of a match operation. * * @param value cost of the operation * @return this builder diff --git a/src/main/java/org/egothor/stemmer/TrieMetadata.java b/src/main/java/org/egothor/stemmer/TrieMetadata.java index 9e85295..02c2759 100644 --- a/src/main/java/org/egothor/stemmer/TrieMetadata.java +++ b/src/main/java/org/egothor/stemmer/TrieMetadata.java @@ -217,6 +217,14 @@ public record TrieMetadata(int formatVersion, WordTraversalDirection traversalDi diacriticProcessingMode, caseProcessingMode); } + /** + * Returns a required metadata entry from a parsed text block. + * + * @param entries parsed metadata entries + * @param key required entry key + * @return non-blank entry value + * @throws IllegalArgumentException if the entry is absent or blank + */ private static String requireEntry(final Map entries, final String key) { final String value = entries.get(key); if (value == null || value.isBlank()) { diff --git a/src/main/java/org/egothor/stemmer/trie/ChildDescriptor.java b/src/main/java/org/egothor/stemmer/trie/ChildDescriptor.java index 0b65b9e..70c65e0 100644 --- a/src/main/java/org/egothor/stemmer/trie/ChildDescriptor.java +++ b/src/main/java/org/egothor/stemmer/trie/ChildDescriptor.java @@ -60,11 +60,23 @@ import java.util.Objects; this.childSignature = childSignature; } + /** + * Returns a hash code consistent with descriptor equality. + * + * @return descriptor hash code + */ @Override public int hashCode() { return Objects.hash(this.edge, this.childSignature); } + /** + * Compares this descriptor with another object. + * + * @param other object to compare with + * @return {@code true} when both descriptors represent the same semantic + * reduction identity + */ @Override public boolean equals(final Object other) { if (this == other) { diff --git a/src/main/java/org/egothor/stemmer/trie/DominantLocalDescriptor.java b/src/main/java/org/egothor/stemmer/trie/DominantLocalDescriptor.java index 5be2a9a..7eed0d6 100644 --- a/src/main/java/org/egothor/stemmer/trie/DominantLocalDescriptor.java +++ b/src/main/java/org/egothor/stemmer/trie/DominantLocalDescriptor.java @@ -53,11 +53,23 @@ import java.util.Objects; this.dominantValue = dominantValue; } + /** + * Returns a hash code consistent with descriptor equality. + * + * @return descriptor hash code + */ @Override public int hashCode() { return Objects.hashCode(this.dominantValue); } + /** + * Compares this descriptor with another object. + * + * @param other object to compare with + * @return {@code true} when both descriptors represent the same semantic + * reduction identity + */ @Override public boolean equals(final Object other) { if (this == other) { diff --git a/src/main/java/org/egothor/stemmer/trie/RankedLocalDescriptor.java b/src/main/java/org/egothor/stemmer/trie/RankedLocalDescriptor.java index 5d7a331..af6efec 100644 --- a/src/main/java/org/egothor/stemmer/trie/RankedLocalDescriptor.java +++ b/src/main/java/org/egothor/stemmer/trie/RankedLocalDescriptor.java @@ -65,11 +65,23 @@ import java.util.List; Collections.unmodifiableList(Arrays.asList(Arrays.copyOf(orderedValues, orderedValues.length)))); } + /** + * Returns a hash code consistent with descriptor equality. + * + * @return descriptor hash code + */ @Override public int hashCode() { return this.orderedValues.hashCode(); } + /** + * Compares this descriptor with another object. + * + * @param other object to compare with + * @return {@code true} when both descriptors represent the same semantic + * reduction identity + */ @Override public boolean equals(final Object other) { if (this == other) { diff --git a/src/main/java/org/egothor/stemmer/trie/UnorderedLocalDescriptor.java b/src/main/java/org/egothor/stemmer/trie/UnorderedLocalDescriptor.java index 5124b38..f29e0c2 100644 --- a/src/main/java/org/egothor/stemmer/trie/UnorderedLocalDescriptor.java +++ b/src/main/java/org/egothor/stemmer/trie/UnorderedLocalDescriptor.java @@ -67,11 +67,23 @@ import java.util.Set; return new UnorderedLocalDescriptor(Collections.unmodifiableSet(distinct)); } + /** + * Returns a hash code consistent with descriptor equality. + * + * @return descriptor hash code + */ @Override public int hashCode() { return this.distinctValues.hashCode(); } + /** + * Compares this descriptor with another object. + * + * @param other object to compare with + * @return {@code true} when both descriptors represent the same semantic + * reduction identity + */ @Override public boolean equals(final Object other) { if (this == other) {