Files

Refine stemmer core, compiled trie workflow, tests, and public documentation

feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening

2026-04-13 02:10:46 +02:00

5.2 KiB

Raw Blame History

Quality and Operations

← Back to README.md

This document describes quality, testing, and operational practices for Radixor.

It focuses on:

reliability and determinism
testing strategies
deployment patterns
performance considerations
lifecycle management of stemmer data

Overview

Radixor is designed to separate:

data preparation (dictionary construction and compilation)
runtime execution (lookup and patch application)

This separation enables:

predictable runtime behavior
reproducible builds
controlled evolution of stemming data

Determinism and reproducibility

Radixor emphasizes deterministic behavior.

Deterministic outputs

Given:

the same dictionary input
the same reduction settings

Radixor guarantees:

identical compiled trie structure
identical value ordering
identical lookup results

Why this matters

stable search behavior across deployments
reproducible builds
easier debugging and regression analysis

Testing strategy

Unit testing

Core components should be tested independently:

patch encoding and decoding
trie construction
reduction behavior
binary serialization and deserialization

Dictionary validation tests

A recommended pattern:

load dictionary input
compile trie
re-apply all word → stem mappings
verify that:

expected stem is present in getAll()
preferred result (get()) is correct when deterministic

This ensures:

no data loss during reduction
correctness of patch encoding

Regression testing

Maintain a stable test dataset:

representative vocabulary
edge cases (short words, long words, ambiguous forms)

Use it to:

detect unintended changes
verify behavior after refactoring
validate reduction mode changes

Performance testing

Performance should be evaluated in terms of:

Throughput

words processed per second

Latency

time per lookup

Memory footprint

size of compiled trie
runtime memory usage

Benchmark with:

realistic token streams
production-like dictionaries

Deployment model

Recommended workflow

prepare dictionary data
compile using CLI
store .radixor.gz artifact
deploy artifact with application
load using loadBinary(...)

Why this model

avoids runtime compilation overhead
reduces startup latency
ensures consistent behavior across environments

Artifact management

Compiled stemmers should be treated as versioned assets.

Versioning

include version in filename or metadata
track dictionary source and reduction settings

Example:

english-v1.2-ranked.radixor.gz

Storage

store in repository or artifact storage
ensure consistent distribution across environments

Runtime usage

Loading

load once during application startup
reuse FrequencyTrie instance

Thread safety

compiled trie is safe for concurrent access
no synchronization required for reads

Avoid repeated loading

Do not:

load trie per request
rebuild trie at runtime

Memory considerations

compiled tries are compact but not negligible
size depends on:
- dictionary size
- reduction mode

Recommendations:

monitor memory usage in production
choose reduction mode appropriately

Reduction mode in production

Default recommendation:

use ranked mode

Switch to other modes only when:

memory constraints are strict
multiple candidate results are not required

Always validate behavior after changing reduction mode.

Dictionary lifecycle

Updating dictionaries

When dictionary data changes:

update source file
recompile
run validation tests
deploy new artifact

Backward compatibility

changes in dictionary may affect stemming results
evaluate impact on search relevance

Observability

Radixor itself does not provide observability features; integration should provide:

logging for loading failures
metrics for lookup throughput
monitoring of memory usage

Optional:

sampling of ambiguous results (getAll())

Error handling

During compilation

Handle:

invalid dictionary format
I/O failures
invalid arguments

During runtime