feat: Apply metadata-driven case normalization in get/getAll
This commit is contained in:
@@ -47,7 +47,7 @@ The CLI supports the following arguments:
|
||||
|
||||
Path to the source dictionary file.
|
||||
|
||||
The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, lowercases it using `Locale.ROOT`, ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
|
||||
The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, processes case according to `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
|
||||
|
||||
Example:
|
||||
|
||||
|
||||
@@ -41,7 +41,7 @@ The parser:
|
||||
|
||||
- reads UTF-8 text,
|
||||
- interprets each line as tab-separated values,
|
||||
- normalizes input to lower case using `Locale.ROOT`,
|
||||
- applies configurable case processing through `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`),
|
||||
- ignores empty lines,
|
||||
- supports remarks introduced by `#` or `//`,
|
||||
- currently ignores dictionary items containing embedded whitespace and reports them through warning-level log entries.
|
||||
|
||||
@@ -111,7 +111,7 @@ This is also valid:
|
||||
|
||||
## Case normalization
|
||||
|
||||
Input lines are normalized to lower case using `Locale.ROOT` before tab-separated columns are processed into dictionary entries.
|
||||
Input-line case normalization is controlled by `CaseProcessingMode`; by default the parser uses `LOWERCASE_WITH_LOCALE_ROOT` before tab-separated columns are processed into dictionary entries.
|
||||
|
||||
That means dictionary authors should treat the format as **case-insensitive at load time**. If a file contains uppercase or mixed-case tokens, they will be normalized during parsing.
|
||||
|
||||
@@ -193,7 +193,7 @@ Run Running Runs Ran
|
||||
CONNECT Connected Connecting
|
||||
```
|
||||
|
||||
This is accepted, but it is normalized to lower case during parsing.
|
||||
This is accepted. Under the default `LOWERCASE_WITH_LOCALE_ROOT` mode it is normalized to lower case during parsing; under `AS_IS` it is preserved.
|
||||
|
||||
## Format limitations
|
||||
|
||||
|
||||
@@ -32,7 +32,7 @@ The `storeOriginal` flag controls whether the canonical stem is inserted as a no
|
||||
|
||||
## Load a textual dictionary
|
||||
|
||||
Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input is normalized to lower case using `Locale.ROOT`, trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.
|
||||
Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input case normalization is controlled by `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
|
||||
@@ -69,7 +69,7 @@ public final class LoadBinaryStemmerExample {
|
||||
|
||||
### Build or extend a stemmer from dictionary data
|
||||
|
||||
Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The parser lowercases input with `Locale.ROOT`, ignores leading and trailing whitespace, and supports line remarks introduced by `#` or `//`.
|
||||
Radixor can also build a compiled trie from a custom dictionary. Dictionary lines consist of a canonical stem followed by zero or more variants. The parser applies `CaseProcessingMode` (default: `LOWERCASE_WITH_LOCALE_ROOT`), ignores leading and trailing whitespace, and supports line remarks introduced by `#` or `//`.
|
||||
|
||||
This path is also relevant when you extend an existing compiled stemmer with additional domain-specific entries and rebuild a new compact artifact.
|
||||
|
||||
@@ -206,4 +206,4 @@ Dictionary compilation is usually a one-time preparation step and is generally f
|
||||
|
||||
## Persisted trie metadata
|
||||
|
||||
Every compiled trie artifact stores a `TrieMetadata` descriptor together with the immutable trie payload. That metadata currently records the binary format version, the `WordTraversalDirection`, the `ReductionSettings` used during compilation, and the declared `DiacriticProcessingMode`. Even when a given release does not yet actively branch on every field at query time, persisting the full descriptor keeps artifacts self-describing and prepares the format for future matching strategies without relying on side-channel configuration.
|
||||
Every compiled trie artifact stores a `TrieMetadata` descriptor together with the immutable trie payload. That metadata currently records the binary format version, the `WordTraversalDirection`, the `ReductionSettings` used during compilation, the declared `DiacriticProcessingMode`, and the selected `CaseProcessingMode`. The traversal and case-processing settings are applied during runtime lookup (`get`, `getAll`), while persisting the full descriptor keeps artifacts self-describing and prepares the format for future matching strategies without relying on side-channel configuration.
|
||||
|
||||
Reference in New Issue
Block a user