docs: improve README, MkDocs content, branding assets, and site polish

This commit is contained in:
2026-04-19 00:18:42 +02:00
parent db79dd2d4f
commit 0b674a39a8
19 changed files with 1836 additions and 1698 deletions

View File

@@ -1,320 +1,56 @@
# Programmatic Usage
This document describes how to use **Radixor** programmatically from Java.
This document provides the programmatic entry point to **Radixor**.
It covers:
Radixor follows a clear lifecycle:
- building a trie from dictionary data
- compiling it into an immutable structure
- loading compiled stemmers
- querying for stems
- working with multiple candidates
- modifying existing compiled stemmers
1. acquire a compiled stemmer,
2. query it for patch commands,
3. apply those commands to produce stems,
4. reopen and extend the compiled structure when needed.
## Conceptual model
Radixor is dictionary-driven, but runtime stemming does not operate by scanning raw dictionary files. A source dictionary is parsed as a sequence of canonical stems and their known variants. Each variant is converted into a compact patch command that transforms the variant into the stem, while the stem itself may optionally be stored as a canonical no-op patch. The mutable trie is then reduced into a compiled read-only structure that stores ordered values and their counts at addressed nodes.
## Overview
Two consequences matter for developers:
Radixor separates the stemming lifecycle into three stages:
- the quality and coverage of stemming behavior depend on dictionary richness,
- runtime usage is based on compiled patch-command lookup rather than on direct dictionary traversal.
1. **Build** collect wordstem mappings in a mutable structure
2. **Compile** reduce and convert to an immutable trie
3. **Query** perform fast runtime lookups
This is why Radixor can generalize beyond explicitly listed forms and why compiled artifacts are well suited for deployment.
These stages are represented by:
## Documentation map
- `FrequencyTrie.Builder` (mutable)
- `FrequencyTrie` (immutable, compiled)
- `StemmerPatchTrieLoader` / `StemmerPatchTrieBinaryIO` (I/O)
The programmatic API is easier to understand when split by developer task:
- [Loading and Building Stemmers](programmatic-loading-and-building.md) explains how to acquire a compiled stemmer from bundled resources, textual dictionaries, binary artifacts, or direct builder usage.
- [Querying and Ambiguity Handling](programmatic-querying-and-ambiguity.md) explains `get(...)`, `getAll(...)`, `getEntries(...)`, patch application, and the practical meaning of reduction modes.
- [Extending and Persisting Compiled Tries](programmatic-extending-and-persistence.md) explains how to reopen compiled tries, add new lexical data, rebuild them, and store them as binary artifacts.
## Core types
## Building a trie programmatically
The main types involved in programmatic usage are:
You can construct a trie directly without using the CLI.
- `FrequencyTrie.Builder<V>` for mutable construction and extension,
- `FrequencyTrie<V>` for the compiled read-only trie,
- `PatchCommandEncoder` for creating and applying patch commands,
- `StemmerPatchTrieLoader` for loading bundled or textual dictionaries,
- `StemmerPatchTrieBinaryIO` for reading and writing compressed binary artifacts,
- `FrequencyTrieBuilders` for reconstructing a mutable builder from a compiled trie,
- `ReductionMode` and `ReductionSettings` for controlling compilation semantics.
```java
import org.egothor.stemmer.*;
public final class BuildExample {
public static void main(String[] args) {
ReductionSettings settings = ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
FrequencyTrie.Builder<String> builder =
new FrequencyTrie.Builder<>(String[]::new, settings);
PatchCommandEncoder encoder = new PatchCommandEncoder();
builder.put("running", encoder.encode("running", "run"));
builder.put("runs", encoder.encode("runs", "run"));
builder.put("ran", encoder.encode("ran", "run"));
FrequencyTrie<String> trie = builder.build();
}
}
```
## Loading from dictionary files
To parse dictionary files directly:
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class LoadFromDictionaryExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
Path.of("data/stemmer.txt"),
true,
ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
)
);
}
}
```
## Loading a compiled binary trie
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class LoadBinaryExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> trie =
StemmerPatchTrieLoader.loadBinary(Path.of("english.radixor.gz"));
}
}
```
This is the **preferred production approach**.
## Querying for stems
### Preferred result
```java
String word = "running";
String patch = trie.get(word);
String stem = PatchCommandEncoder.apply(word, patch);
```
### All candidates
```java
String[] patches = trie.getAll(word);
for (String patch : patches) {
String stem = PatchCommandEncoder.apply(word, patch);
}
```
## Accessing value frequencies
For diagnostic or advanced use cases:
```java
import org.egothor.stemmer.ValueCount;
java.util.List<ValueCount<String>> entries = trie.getEntries("axes");
for (ValueCount<String> entry : entries) {
String patch = entry.value();
int count = entry.count();
}
```
This allows:
* inspecting ambiguity
* understanding ranking decisions
* debugging dictionary quality
## Using bundled language resources
```java
FrequencyTrie<String> trie = StemmerPatchTrieLoader.load(
StemmerPatchTrieLoader.Language.US_UK_PROFI,
true,
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
```
Bundled dictionaries are useful for:
* quick integration
* testing
* reference behavior
## Persisting a compiled trie
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class SaveExample {
public static void main(String[] args) throws IOException {
StemmerPatchTrieBinaryIO.write(trie, Path.of("english.radixor.gz"));
}
}
```
## Modifying an existing trie
A compiled trie can be reopened into a builder, extended, and rebuilt.
```java
import java.io.IOException;
import java.nio.file.Path;
import org.egothor.stemmer.*;
public final class ModifyExample {
public static void main(String[] args) throws IOException {
FrequencyTrie<String> compiled =
StemmerPatchTrieBinaryIO.read(Path.of("english.radixor.gz"));
ReductionSettings settings = ReductionSettings.withDefaults(
ReductionMode.MERGE_SUBTREES_WITH_EQUIVALENT_RANKED_GET_ALL_RESULTS
);
FrequencyTrie.Builder<String> builder =
FrequencyTrieBuilders.copyOf(compiled, String[]::new, settings);
builder.put("microservices", PatchCommandEncoder.NOOP_PATCH);
FrequencyTrie<String> updated = builder.build();
StemmerPatchTrieBinaryIO.write(updated,
Path.of("english-custom.radixor.gz"));
}
}
```
## Thread safety
* `FrequencyTrie` (compiled):
* **thread-safe**
* safe for concurrent reads
* `FrequencyTrie.Builder`:
* **not thread-safe**
* intended for single-threaded construction
## Performance characteristics
### Querying
* O(length of word)
* minimal allocations
* suitable for high-throughput pipelines
### Loading
* binary loading is fast
* no preprocessing required
### Building
* depends on dictionary size
* reduction phase may be CPU-intensive
## Best practices
### Reuse compiled trie instances
* load once
* share across threads
### Prefer binary loading in production
* avoid rebuilding at runtime
* treat compiled files as deployable artifacts
### Use `getAll()` only when needed
* `get()` is faster and sufficient for most use cases
### Keep builders short-lived
* build → compile → discard
## Integration patterns
### Search systems
* apply stemming during indexing and querying
* ensure consistent dictionary usage
### Text normalization pipelines
* integrate as a transformation step
* combine with tokenization and filtering
### Domain adaptation
* extend dictionaries with domain-specific vocabulary
* rebuild compiled artifacts
## Recommended reading order
For most developers, the best order is:
1. [Loading and Building Stemmers](programmatic-loading-and-building.md)
2. [Querying and Ambiguity Handling](programmatic-querying-and-ambiguity.md)
3. [Extending and Persisting Compiled Tries](programmatic-extending-and-persistence.md)
## Next steps
* [Dictionary format](dictionary-format.md)
* [CLI compilation](cli-compilation.md)
* [Architecture and reduction](architecture-and-reduction.md)
## Summary
Programmatic usage of Radixor follows a clear pattern:
* build or load a trie
* query using patch commands
* apply transformations
The API is intentionally simple at the surface, while providing deeper control when needed for:
* ambiguity handling
* diagnostics
* dictionary evolution
- [Quick Start](quick-start.md)
- [CLI compilation](cli-compilation.md)
- [Dictionary format](dictionary-format.md)
- [Architecture and reduction](architecture-and-reduction.md)