feat: Prepare TrieMetadata and new stemmer data integration
This commit is contained in:
@@ -25,11 +25,11 @@ Each stage has a different purpose.
|
||||
The textual dictionary groups known word forms under a canonical stem:
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
```
|
||||
|
||||
The first token is the canonical stem. The following tokens are known variants.
|
||||
The first column is the canonical stem. The following tab-separated columns are known variants.
|
||||
|
||||
### Patch-command generation
|
||||
|
||||
|
||||
@@ -1,41 +1,49 @@
|
||||
# Built-in Languages
|
||||
|
||||
Radixor provides a set of bundled stemmer dictionaries that can be loaded directly without preparing custom lexical data first.
|
||||
|
||||
These resources are intended as practical default dictionaries for common use. They provide a solid starting point for evaluation, integration, and general-purpose stemming workloads, while still fitting naturally into workflows where the bundled baseline is later refined, extended, or replaced by a custom dictionary.
|
||||
Radixor ships with a curated set of bundled stemmer dictionaries that can be loaded directly from the library distribution. These resources are intended to provide an immediately usable baseline for evaluation, prototyping, integration, and general-purpose stemming workloads, while still fitting naturally into workflows where the bundled baseline is later refined, extended, or replaced with custom lexical data.
|
||||
|
||||
## Overview
|
||||
|
||||
Bundled dictionaries are exposed through:
|
||||
|
||||
```java
|
||||
StemmerPatchTrieLoader.Language
|
||||
org.egothor.stemmer.StemmerPatchTrieLoader.Language
|
||||
```
|
||||
|
||||
They are packaged with the library as text resources and compiled into a `FrequencyTrie<String>` when loaded.
|
||||
Each bundled dictionary is packaged with the library as a compressed UTF-8 text resource. When loaded, the resource is parsed by `StemmerDictionaryParser`, transformed into patch-command mappings, and compiled into a read-only `FrequencyTrie<String>` by `StemmerPatchTrieLoader`.
|
||||
|
||||
## Supported languages
|
||||
The bundled language definition also carries a language-level right-to-left flag. That flag is used by the loader to derive the `WordTraversalDirection` used for both trie-key construction and patch-command generation. In practice, left-to-right bundled languages use historical backward Egothor traversal, while right-to-left bundled languages use forward traversal over the stored form.
|
||||
|
||||
## Supported bundled languages
|
||||
|
||||
The following bundled language identifiers are currently available:
|
||||
|
||||
| Language | Enum constant | Notes |
|
||||
|---|---|---|
|
||||
| Danish | `DA_DK` | Bundled general-purpose dictionary |
|
||||
| German | `DE_DE` | Bundled general-purpose dictionary |
|
||||
| Spanish | `ES_ES` | Bundled general-purpose dictionary |
|
||||
| French | `FR_FR` | Bundled general-purpose dictionary |
|
||||
| Italian | `IT_IT` | Bundled general-purpose dictionary |
|
||||
| Dutch | `NL_NL` | Bundled general-purpose dictionary |
|
||||
| Norwegian | `NO_NO` | Bundled general-purpose dictionary |
|
||||
| Portuguese | `PT_PT` | Bundled general-purpose dictionary |
|
||||
| Russian | `RU_RU` | Currently supplied in normalized transliterated form |
|
||||
| Swedish | `SV_SE` | Bundled general-purpose dictionary |
|
||||
| English | `US_UK` | Standard English dictionary |
|
||||
| English | `US_UK_PROFI` | Extended English dictionary |
|
||||
| Language | Enum constant | Writing direction | Notes |
|
||||
|---|---|---:|---|
|
||||
| Czech | `CS_CZ` | LTR | Bundled general-purpose dictionary |
|
||||
| Danish | `DA_DK` | LTR | Bundled general-purpose dictionary |
|
||||
| German | `DE_DE` | LTR | Bundled general-purpose dictionary |
|
||||
| Spanish | `ES_ES` | LTR | Bundled general-purpose dictionary |
|
||||
| Persian | `FA_IR` | RTL | Bundled dictionary uses forward traversal over the stored form |
|
||||
| Finnish | `FI_FI` | LTR | Bundled general-purpose dictionary |
|
||||
| French | `FR_FR` | LTR | Bundled general-purpose dictionary |
|
||||
| Hebrew | `HE_IL` | RTL | Bundled dictionary uses forward traversal over the stored form |
|
||||
| Hungarian | `HU_HU` | LTR | Bundled general-purpose dictionary |
|
||||
| Italian | `IT_IT` | LTR | Bundled general-purpose dictionary |
|
||||
| Norwegian Bokmål | `NB_NO` | LTR | Bundled general-purpose dictionary |
|
||||
| Dutch | `NL_NL` | LTR | Bundled general-purpose dictionary |
|
||||
| Norwegian Nynorsk | `NN_NO` | LTR | Bundled general-purpose dictionary |
|
||||
| Polish | `PL_PL` | LTR | Bundled general-purpose dictionary |
|
||||
| Portuguese | `PT_PT` | LTR | Bundled general-purpose dictionary |
|
||||
| Russian | `RU_RU` | LTR | Bundled general-purpose dictionary |
|
||||
| Swedish | `SV_SE` | LTR | Bundled general-purpose dictionary |
|
||||
| Ukrainian | `UK_UA` | LTR | Bundled general-purpose dictionary |
|
||||
| English | `US_UK` | LTR | Bundled general-purpose dictionary |
|
||||
| Yiddish | `YI` | RTL | Bundled dictionary uses forward traversal over the stored form |
|
||||
|
||||
## Basic usage
|
||||
|
||||
Load a bundled stemmer like this:
|
||||
Load a bundled dictionary like this:
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
@@ -52,16 +60,18 @@ public final class BuiltInExample {
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
StemmerPatchTrieLoader.Language.US_UK,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||
|
||||
System.out.println(trie.traversalDirection());
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The loader reads the bundled dictionary resource, parses the textual entries, derives patch-command mappings, and compiles the result into a read-only trie.
|
||||
This call loads the bundled dictionary resource for the selected language, parses its lexical entries, derives patch-command mappings, and compiles the result into a read-only trie.
|
||||
|
||||
## Example: stemming with `US_UK_PROFI`
|
||||
## Example: stemming with a bundled dictionary
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
@@ -79,44 +89,49 @@ public final class EnglishExample {
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
StemmerPatchTrieLoader.Language.US_UK,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||
|
||||
final String word = "running";
|
||||
final String patch = trie.get(word);
|
||||
final String stem = PatchCommandEncoder.apply(word, patch);
|
||||
final String stem = PatchCommandEncoder.apply(word, patch, trie.traversalDirection());
|
||||
|
||||
System.out.println(word + " -> " + stem);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## `US_UK` and `US_UK_PROFI`
|
||||
Passing `trie.traversalDirection()` to `PatchCommandEncoder.apply(...)` is the correct general contract. It ensures that the patch is applied using the same logical traversal model that was used when the trie and its patch commands were produced.
|
||||
|
||||
Radixor currently provides two bundled English variants.
|
||||
## Traversal behavior and right-to-left languages
|
||||
|
||||
### `US_UK`
|
||||
Bundled dictionaries are not all processed identically.
|
||||
|
||||
`US_UK` is the lighter-weight bundled English resource. It is suitable where a smaller default dictionary is preferred and maximal lexical coverage is not the primary goal.
|
||||
For traditional left-to-right suffix-oriented resources, Radixor preserves historical Egothor behavior and traverses logical word characters backward. That means trie paths are constructed from the logical end of the stored word toward its beginning, and patch commands are interpreted with the same backward traversal model.
|
||||
|
||||
### `US_UK_PROFI`
|
||||
For bundled right-to-left languages such as Persian, Hebrew, and Yiddish, Radixor uses forward traversal over the stored form. In those cases:
|
||||
|
||||
`US_UK_PROFI` is the more extensive bundled English resource. It offers broader lexical coverage and is the better default for most applications that want stronger out-of-the-box behavior.
|
||||
- trie keys are traversed from the logical beginning of the stored form,
|
||||
- patch commands are generated in that same forward direction,
|
||||
- patch application must use `WordTraversalDirection.FORWARD`, which is naturally obtained from `trie.traversalDirection()`.
|
||||
|
||||
### Recommendation
|
||||
This design keeps the traversal policy explicit and consistent across dictionary loading, trie lookup, binary persistence, builder reconstruction, and patch application.
|
||||
|
||||
For most English-language deployments, prefer:
|
||||
## Reduction behavior
|
||||
|
||||
```text
|
||||
US_UK_PROFI
|
||||
```
|
||||
Bundled dictionaries can be compiled using any supported `ReductionMode`. The reduction configuration controls how semantically equivalent subtrees are merged during trie compilation, while preserving the contract of the selected mode.
|
||||
|
||||
Use `US_UK` when a smaller bundled baseline is more appropriate.
|
||||
Typical entry points are:
|
||||
|
||||
- `StemmerPatchTrieLoader.load(language, storeOriginal, reductionMode)`
|
||||
- `StemmerPatchTrieLoader.load(language, storeOriginal, reductionSettings)`
|
||||
|
||||
For most users, `ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS` is the most conservative general-purpose choice because it preserves ranked `getAll(...)` behavior.
|
||||
|
||||
## Intended role of bundled dictionaries
|
||||
|
||||
Bundled dictionaries should be understood as **general-purpose default resources**.
|
||||
Bundled dictionaries should be understood as practical default resources.
|
||||
|
||||
They are a good fit when:
|
||||
|
||||
@@ -125,15 +140,18 @@ They are a good fit when:
|
||||
- a reasonable baseline is sufficient,
|
||||
- the goal is evaluation, prototyping, or straightforward integration.
|
||||
|
||||
They are also well suited to staged refinement workflows in which the bundled base is loaded first, then extended with domain-specific vocabulary, and finally persisted as a custom binary artifact.
|
||||
They are also well suited to staged refinement workflows in which a bundled base is loaded first, then extended with domain-specific vocabulary, and finally persisted as a custom binary artifact.
|
||||
|
||||
## Character representation
|
||||
|
||||
The current bundled resources follow a pragmatic normalization convention.
|
||||
Bundled dictionaries are ordinary UTF-8 lexical resources. The parser reads them as text, the trie stores standard Java strings, and the patch-command model operates on general character sequences.
|
||||
|
||||
At present, bundled dictionaries are supplied in normalized plain-ASCII form. For some languages, this is simply a lightweight maintenance convention. For others, especially languages commonly written in another script, it reflects a transliterated lexical resource. Russian is the clearest example in the current bundled set.
|
||||
This is important for two reasons:
|
||||
|
||||
This convention belongs to the supplied dictionary resources, not to the core stemming model. The parser reads UTF-8 text, the dictionary model works with ordinary Java strings, and the trie and patch-command mechanism operate on general character sequences. In practical terms, the architecture is compatible with native-script dictionaries when suitable lexical resources are available.
|
||||
1. the built-in resources are not limited to ASCII-only processing,
|
||||
2. the traversal model is orthogonal to character encoding and script choice.
|
||||
|
||||
In other words, right-to-left handling in the loader is about logical traversal strategy, not about introducing a separate character model.
|
||||
|
||||
## When to prefer custom dictionaries
|
||||
|
||||
@@ -141,8 +159,8 @@ A custom dictionary is usually the better choice when:
|
||||
|
||||
- domain-specific vocabulary materially affects stemming quality,
|
||||
- lexical coverage must be controlled more precisely,
|
||||
- a stronger language resource is available than the bundled baseline,
|
||||
- native-script support is needed beyond the currently bundled resources.
|
||||
- a stronger lexical resource is available than the bundled baseline,
|
||||
- operational requirements demand an explicitly curated, versioned artifact.
|
||||
|
||||
Typical examples include:
|
||||
|
||||
@@ -150,7 +168,7 @@ Typical examples include:
|
||||
- biomedical language,
|
||||
- legal or financial vocabulary,
|
||||
- organization-specific product and process names,
|
||||
- language resources maintained in native scripts.
|
||||
- dictionaries maintained with project-specific validation rules.
|
||||
|
||||
## Production recommendation
|
||||
|
||||
@@ -158,11 +176,11 @@ For production systems, the most robust workflow is usually:
|
||||
|
||||
1. start from a bundled dictionary when it is suitable,
|
||||
2. extend it with domain-specific forms if needed,
|
||||
3. compile or rebuild it into a binary `.radixor.gz` artifact,
|
||||
4. deploy that compiled artifact,
|
||||
5. load it at runtime using `loadBinary(...)`.
|
||||
3. rebuild it into a binary artifact,
|
||||
4. deploy that compiled binary artifact,
|
||||
5. load it at runtime through `loadBinary(...)`.
|
||||
|
||||
This avoids repeated startup parsing and makes the deployed stemming behavior explicit and versionable.
|
||||
This avoids repeated startup parsing and makes the deployed stemming behavior explicit, reproducible, and versionable.
|
||||
|
||||
## Example refinement workflow
|
||||
|
||||
@@ -185,7 +203,7 @@ public final class BundledRefinementExample {
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final FrequencyTrie<String> base = StemmerPatchTrieLoader.load(
|
||||
StemmerPatchTrieLoader.Language.US_UK_PROFI,
|
||||
StemmerPatchTrieLoader.Language.US_UK,
|
||||
true,
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS);
|
||||
|
||||
@@ -204,11 +222,27 @@ public final class BundledRefinementExample {
|
||||
}
|
||||
```
|
||||
|
||||
The reconstructed builder preserves the traversal direction of the source trie, so refinements remain semantically aligned with the original bundled dictionary.
|
||||
|
||||
## Extending language support
|
||||
|
||||
The built-in set is intentionally a practical baseline rather than a closed catalog. High-quality dictionaries for additional languages, improved language coverage, and stronger native-script resources are all natural extension paths for the project.
|
||||
The built-in set is intentionally a practical baseline rather than a closed catalog. Additional languages, stronger lexical coverage, and improved dictionaries for currently supported languages are all natural extension paths.
|
||||
|
||||
What matters most is not only the number of entries, but the quality, consistency, and operational usefulness of the lexical resource being added.
|
||||
What matters most is not only the number of entries, but the quality, consistency, maintainability, and operational usefulness of the lexical resource being added.
|
||||
|
||||
## Related API surface
|
||||
|
||||
The following types are typically involved when working with bundled dictionaries:
|
||||
|
||||
- `StemmerPatchTrieLoader`
|
||||
- `StemmerPatchTrieLoader.Language`
|
||||
- `FrequencyTrie`
|
||||
- `PatchCommandEncoder`
|
||||
- `WordTraversalDirection`
|
||||
- `ReductionMode`
|
||||
- `ReductionSettings`
|
||||
- `StemmerPatchTrieBinaryIO`
|
||||
- `FrequencyTrieBuilders`
|
||||
|
||||
## Next steps
|
||||
|
||||
@@ -219,4 +253,4 @@ What matters most is not only the number of entries, but the quality, consistenc
|
||||
|
||||
## Summary
|
||||
|
||||
Radixor’s built-in language support provides immediate usability, practical default dictionaries, and a strong starting point for custom refinement. The current bundled resources follow a pragmatic normalization convention, while the underlying architecture remains well suited to richer language resources and future extensions.
|
||||
Radixor’s built-in language support provides immediate usability, a professionally defined baseline API, and a practical starting point for custom refinement. The bundled set now includes both left-to-right and right-to-left languages, and the library models that distinction explicitly through `WordTraversalDirection` so that trie construction, lookup, and patch application remain consistent.
|
||||
|
||||
@@ -9,7 +9,7 @@ This is the preferred preparation workflow when stemming should run against an a
|
||||
The `Compile` tool performs the following steps:
|
||||
|
||||
1. reads the input dictionary in the standard Radixor stemmer format,
|
||||
2. parses each line into a canonical stem and its known variants,
|
||||
2. parses each line into a canonical stem column and its known variant columns,
|
||||
3. converts variants into patch commands,
|
||||
4. builds a mutable trie of patch-command values,
|
||||
5. applies the configured reduction mode,
|
||||
@@ -21,7 +21,7 @@ This workflow is intentionally aligned with the same dictionary semantics used e
|
||||
|
||||
```bash
|
||||
java org.egothor.stemmer.Compile \
|
||||
--input ./data/stemmer.txt \
|
||||
--input ./data/stemmer.tsv \
|
||||
--output ./build/english.radixor.gz \
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
|
||||
--store-original \
|
||||
@@ -47,12 +47,12 @@ The CLI supports the following arguments:
|
||||
|
||||
Path to the source dictionary file.
|
||||
|
||||
The file must use the standard line-oriented dictionary format. Each non-empty logical line starts with the canonical stem and may contain zero or more variants. The parser expects UTF-8 input, lowercases it using `Locale.ROOT`, and ignores trailing remarks introduced by `#` or `//`.
|
||||
The file must use the standard line-oriented tab-separated values dictionary format, meaning that columns are separated by the tab character. Each non-empty logical line starts with the canonical stem column and may contain zero or more variant columns. The parser expects UTF-8 input, lowercases it using `Locale.ROOT`, ignores trailing remarks introduced by `#` or `//`, and currently ignores dictionary items containing embedded whitespace while reporting them through warning-level log entries.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
--input ./data/stemmer.txt
|
||||
--input ./data/stemmer.tsv
|
||||
```
|
||||
|
||||
### `--output <file>`
|
||||
@@ -190,15 +190,15 @@ Compilation is usually a one-time step and is generally fast. The more important
|
||||
### 1. Prepare a dictionary
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
connect connected connecting
|
||||
run running runs ran
|
||||
connect connected connecting
|
||||
```
|
||||
|
||||
### 2. Compile it
|
||||
|
||||
```bash
|
||||
java org.egothor.stemmer.Compile \
|
||||
--input ./data/stemmer.txt \
|
||||
--input ./data/stemmer.tsv \
|
||||
--output ./build/english.radixor.gz \
|
||||
--reduction-mode MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS \
|
||||
--store-original
|
||||
|
||||
@@ -22,27 +22,29 @@ In practice, dictionary quality matters more than dictionary size. A smaller but
|
||||
|
||||
## Preferred dictionary shape
|
||||
|
||||
Radixor uses a simple line-oriented format:
|
||||
Radixor uses a simple line-oriented tab-separated values format, meaning that columns are separated by the tab character:
|
||||
|
||||
```text
|
||||
<stem> <variant1> <variant2> <variant3> ...
|
||||
<stem> <variant1> <variant2> <variant3> ...
|
||||
```
|
||||
|
||||
The first token on a line is the canonical stem. All following tokens on that line are known variants that should reduce to that stem.
|
||||
The first column on a line is the canonical stem. All following tab-separated columns on that line are known variants that should reduce to that stem.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
```
|
||||
|
||||
The parser:
|
||||
|
||||
- reads UTF-8 text,
|
||||
- interprets each line as tab-separated values,
|
||||
- normalizes input to lower case using `Locale.ROOT`,
|
||||
- ignores empty lines,
|
||||
- supports remarks introduced by `#` or `//`.
|
||||
- supports remarks introduced by `#` or `//`,
|
||||
- currently ignores dictionary items containing embedded whitespace and reports them through warning-level log entries.
|
||||
|
||||
For full format details, see [Dictionary format](dictionary-format.md).
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Dictionary Format
|
||||
|
||||
Radixor uses a simple line-oriented dictionary format designed for practical stemming workflows.
|
||||
Radixor uses a simple line-oriented dictionary format designed for practical stemming workflows. The textual source format is tab-separated values, meaning that columns are separated by the tab character.
|
||||
|
||||
Each logical line describes one canonical stem and zero or more known word variants that should reduce to that stem. The format is intentionally lightweight, easy to maintain in source control, and directly consumable both by the programmatic loader and by the CLI compiler.
|
||||
|
||||
@@ -9,16 +9,16 @@ Each logical line describes one canonical stem and zero or more known word varia
|
||||
Each non-empty logical line has the following shape:
|
||||
|
||||
```text
|
||||
<stem> <variant1> <variant2> <variant3> ...
|
||||
<stem> <variant1> <variant2> <variant3> ...
|
||||
```
|
||||
|
||||
The first token is interpreted as the **canonical stem**. Every following token on the same line is interpreted as a **known variant** belonging to that stem.
|
||||
The first column is interpreted as the **canonical stem**. Every following token on the same line is interpreted as a **known variant** belonging to that stem.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
```
|
||||
|
||||
In this example:
|
||||
@@ -30,7 +30,7 @@ In this example:
|
||||
|
||||
When a dictionary is loaded through `StemmerPatchTrieLoader`, the loader processes each parsed line as follows:
|
||||
|
||||
1. the first token becomes the canonical stem,
|
||||
1. the first column becomes the canonical stem,
|
||||
2. every following token is treated as a variant,
|
||||
3. each variant is converted into a patch command that transforms the variant into the stem,
|
||||
4. if `storeOriginal` is enabled, the stem itself is also inserted using the canonical no-op patch command.
|
||||
@@ -52,21 +52,23 @@ Whether such a line is operationally useful depends on how the dictionary is loa
|
||||
- if `storeOriginal` is enabled, the stem itself is inserted as a no-op mapping,
|
||||
- if `storeOriginal` is disabled, the line contributes no explicit variant mappings.
|
||||
|
||||
## Whitespace rules
|
||||
## Column and whitespace rules
|
||||
|
||||
Tokens are separated by whitespace. Leading and trailing whitespace is ignored.
|
||||
Columns are separated by the tab character. Leading and trailing whitespace around each column is ignored.
|
||||
|
||||
These lines are equivalent:
|
||||
This is the canonical form:
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
run running runs ran
|
||||
```
|
||||
|
||||
This is also accepted because the surrounding padding is removed before the item is processed:
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
run running runs ran
|
||||
```
|
||||
|
||||
Tabs and repeated spaces are both accepted because tokenization is whitespace-based.
|
||||
Embedded whitespace inside one dictionary item is currently not supported. A stem or variant such as `new york` therefore cannot yet be represented as one usable dictionary item in the textual source format. Such items are ignored during parsing and reported through a warning-level log entry together with the physical line number, the stem, and the ignored items from that line.
|
||||
|
||||
## Empty lines
|
||||
|
||||
@@ -75,9 +77,9 @@ Empty lines are ignored.
|
||||
Example:
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
run running runs ran
|
||||
|
||||
connect connected connecting
|
||||
connect connected connecting
|
||||
```
|
||||
|
||||
The blank line between entries has no effect.
|
||||
@@ -96,8 +98,8 @@ The earliest occurrence of either marker terminates the logical content of the l
|
||||
Examples:
|
||||
|
||||
```text
|
||||
run running runs ran # English verb forms
|
||||
connect connected connecting // Common derived forms
|
||||
run running runs ran # English verb forms
|
||||
connect connected connecting // Common derived forms
|
||||
```
|
||||
|
||||
This is also valid:
|
||||
@@ -109,20 +111,20 @@ This is also valid:
|
||||
|
||||
## Case normalization
|
||||
|
||||
Input lines are normalized to lower case using `Locale.ROOT` before tokenization is processed into dictionary entries.
|
||||
Input lines are normalized to lower case using `Locale.ROOT` before tab-separated columns are processed into dictionary entries.
|
||||
|
||||
That means dictionary authors should treat the format as **case-insensitive at load time**. If a file contains uppercase or mixed-case tokens, they will be normalized during parsing.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
Run Running Runs Ran
|
||||
Run Running Runs Ran
|
||||
```
|
||||
|
||||
is processed the same way as:
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
run running runs ran
|
||||
```
|
||||
|
||||
## Character set and practical convention
|
||||
@@ -142,8 +144,8 @@ The format expresses a one-line grouping of forms under a canonical stem. It doe
|
||||
For example:
|
||||
|
||||
```text
|
||||
axis axes
|
||||
axe axes
|
||||
axis axes
|
||||
axe axes
|
||||
```
|
||||
|
||||
These are simply two independent lines. If both contribute mappings for the same surface form, the compiled trie may later expose one or more candidate patch commands depending on the accumulated local counts and the selected reduction mode.
|
||||
@@ -163,32 +165,32 @@ As a result, repeating the same mapping is not just redundant text. It can influ
|
||||
### Simple English example
|
||||
|
||||
```text
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
build building builds built
|
||||
run running runs ran
|
||||
connect connected connecting connection
|
||||
build building builds built
|
||||
```
|
||||
|
||||
### Dictionary with remarks
|
||||
|
||||
```text
|
||||
run running runs ran # canonical verb family
|
||||
connect connected connecting // derived forms
|
||||
build building builds built
|
||||
run running runs ran # canonical verb family
|
||||
connect connected connecting // derived forms
|
||||
build building builds built
|
||||
```
|
||||
|
||||
### Stem-only entries
|
||||
|
||||
```text
|
||||
run
|
||||
connect connected connecting
|
||||
connect connected connecting
|
||||
build
|
||||
```
|
||||
|
||||
### Mixed case input
|
||||
|
||||
```text
|
||||
Run Running Runs Ran
|
||||
CONNECT Connected Connecting
|
||||
Run Running Runs Ran
|
||||
CONNECT Connected Connecting
|
||||
```
|
||||
|
||||
This is accepted, but it is normalized to lower case during parsing.
|
||||
@@ -204,7 +206,7 @@ The current dictionary format intentionally stays minimal:
|
||||
- no explicit ambiguity syntax,
|
||||
- no sectioning or nested structure.
|
||||
|
||||
Each token is simply a whitespace-delimited word form after remark stripping and lowercasing.
|
||||
Each dictionary item is simply one tab-separated word form after remark stripping and lowercasing.
|
||||
|
||||
## Authoring guidance
|
||||
|
||||
|
||||
@@ -87,3 +87,20 @@ This model works especially well when domain-specific extensions are added in la
|
||||
|
||||
- [Loading and Building Stemmers](programmatic-loading-and-building.md)
|
||||
- [Querying and Ambiguity Handling](programmatic-querying-and-ambiguity.md)
|
||||
|
||||
|
||||
## Inspecting persisted metadata
|
||||
|
||||
After loading a compiled artifact, applications can inspect the persisted build descriptor directly:
|
||||
|
||||
```java
|
||||
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.loadBinary("build/stemmers/cs_cz.dat.gz");
|
||||
final TrieMetadata metadata = trie.metadata();
|
||||
|
||||
System.out.println(metadata.formatVersion());
|
||||
System.out.println(metadata.traversalDirection());
|
||||
System.out.println(metadata.reductionSettings().reductionMode());
|
||||
System.out.println(metadata.diacriticProcessingMode());
|
||||
```
|
||||
|
||||
This is especially useful when a deployment manages multiple artifacts compiled under different traversal or reduction regimes.
|
||||
|
||||
@@ -32,7 +32,7 @@ The `storeOriginal` flag controls whether the canonical stem is inserted as a no
|
||||
|
||||
## Load a textual dictionary
|
||||
|
||||
Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. Each non-empty logical line starts with the stem and may contain zero or more variants. Input is normalized to lower case using `Locale.ROOT`, and trailing remarks introduced by `#` or `//` are ignored.
|
||||
Loading from a dictionary file follows the same preparation model as bundled resources, but the source comes from your own file or path. The textual format is tab-separated values, meaning that columns are separated by the tab character. Each non-empty logical line starts with the stem column and may contain zero or more variant columns. Input is normalized to lower case using `Locale.ROOT`, trailing remarks introduced by `#` or `//` are ignored, and dictionary items containing embedded whitespace are currently ignored with warning-level diagnostics.
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
@@ -51,7 +51,7 @@ public final class LoadTextDictionaryExample {
|
||||
|
||||
public static void main(final String[] arguments) throws IOException {
|
||||
final FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
|
||||
Path.of("data", "stemmer.txt"),
|
||||
Path.of("data", "stemmer.tsv"),
|
||||
true,
|
||||
ReductionSettings.withDefaults(
|
||||
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS));
|
||||
|
||||
@@ -15,7 +15,7 @@ A compiled stemmer can be obtained in three common ways.
|
||||
|
||||
### Use a bundled language dictionary
|
||||
|
||||
Radixor ships with bundled dictionaries for a set of supported languages. These resources are line-oriented dictionaries stored with the library and compiled into a `FrequencyTrie<String>` when loaded. The loader can also store the canonical stem itself as a no-op patch command.
|
||||
Radixor ships with bundled dictionaries for a set of supported languages. These resources are line-oriented dictionaries stored with the library and compiled into a `FrequencyTrie<String>` when loaded. The loader can also store the canonical stem itself as a no-op patch command. Compiled trie artifacts now persist self-describing metadata, including the traversal direction and compilation reduction settings used to build the artifact.
|
||||
|
||||
```java
|
||||
import java.io.IOException;
|
||||
@@ -202,3 +202,8 @@ Dictionary compilation is usually a one-time preparation step and is generally f
|
||||
- [CLI compilation](cli-compilation.md)
|
||||
- [Built-in languages](built-in-languages.md)
|
||||
- [Architecture and reduction](architecture-and-reduction.md)
|
||||
|
||||
|
||||
## Persisted trie metadata
|
||||
|
||||
Every compiled trie artifact stores a `TrieMetadata` descriptor together with the immutable trie payload. That metadata currently records the binary format version, the `WordTraversalDirection`, the `ReductionSettings` used during compilation, and the declared `DiacriticProcessingMode`. Even when a given release does not yet actively branch on every field at query time, persisting the full descriptor keeps artifacts self-describing and prepares the format for future matching strategies without relying on side-channel configuration.
|
||||
|
||||
Reference in New Issue
Block a user