feat: Apply metadata-driven case normalization in get/getAll

2026-04-23 22:32:05 +02:00
parent 4d939f5b6e
commit 8785f2b7cb
14 changed files with 353 additions and 43 deletions
--- a/docs/cli-compilation.md
+++ b/docs/cli-compilation.md
@@ -47,7 +47,7 @@ The CLI supports the following arguments:

 Path to the source dictionary file.

-The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, lowercases it using `Locale.ROOT`, ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
+The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.

 Example:

--- a/docs/contributing-dictionaries.md
+++ b/docs/contributing-dictionaries.md
@@ -41,7 +41,7 @@ The parser:

 - reads UTF-8 text,
 - interprets each line as tab-separated values,
- normalizes input to lower case using `Locale.ROOT`,
+- applies configurable case processing through `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`),
 - ignores empty lines,
 - supports remarks introduced by `#` or `//`,
 - currently ignores dictionary items containing embedded whitespace and reports them through warning-level log entries.
--- a/docs/dictionary-format.md
+++ b/docs/dictionary-format.md
@@ -111,7 +111,7 @@ This is also valid:

 ## Case normalization

-Input lines are normalized to lower case using `Locale.ROOT` before tab-separated columns are processed into dictionary entries.
+Input-line case normalization is controlled by `CaseProcessingMode`; by default the parser uses `LOWERCASE_WITH_LOCALE_ROOT` before tab-separated columns are processed into dictionary entries.

 That means dictionary authors should treat the format as **case-insensitive at load time**. If a file contains uppercase or mixed-case tokens, they will be normalized during parsing.

@@ -193,7 +193,7 @@ Run	Running	Runs	Ran
 CONNECT	Connected	Connecting
 ```

-This is accepted, but it is normalized to lower case during parsing.
+This is accepted. Under the default `LOWERCASE_WITH_LOCALE_ROOT` mode it is normalized to lower case during parsing; under `AS_IS` it is preserved.

 ## Format limitations

--- a/docs/programmatic-loading-and-building.md
+++ b/docs/programmatic-loading-and-building.md
@@ -32,7 +32,7 @@ The `storeOriginal` flag controls whether the canonical stem is inserted as a no

 ## Load a textual dictionary

-Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input is normalized to lower case using `Locale.ROOT`, trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.
+Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input case normalization is controlled by `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.

 ```java
 import java.io.IOException;
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -69,7 +69,7 @@ public final class LoadBinaryStemmerExample {

 ### Build or extend a stemmer from dictionary data

-Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The parser lowercases input with `Locale.ROOT`, ignores leading and trailing whitespace, and supports line remarks introduced by `#` or `//`.
+Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace, and supports line remarks introduced by `#` or `//`.

 This path is also relevant when you extend an existing compiled stemmer with additional domain-specific entries and rebuild a new compact artifact.

@@ -206,4 +206,4 @@ Dictionary compilation is usually a one-time preparation step and is generally f

 ## Persisted trie metadata

-Every compiled trie artifact stores a `TrieMetadata` descriptor together with the immutable trie payload. That metadata currently records the binary format version, the `WordTraversalDirection`, the `ReductionSettings` used during compilation, and the declared `DiacriticProcessingMode`. Even when a given release does not yet actively branch on every field at query time, persisting the full descriptor keeps artifacts self-describing and prepares the format for future matching strategies without relying on side-channel configuration.
+Every compiled trie artifact stores a `TrieMetadata` descriptor together with the immutable trie payload. That metadata currently records the binary format version, the `WordTraversalDirection`, the `ReductionSettings` used during compilation, the declared `DiacriticProcessingMode`, and the selected `CaseProcessingMode`. The traversal and case-processing settings are applied during runtime lookup (`get`, `getAll`), while persisting the full descriptor keeps artifacts self-describing and prepares the format for future matching strategies without relying on side-channel configuration.