Files
Radixor/docs/quality-and-operations.md
Leo Galambos 038514bad0 Refine stemmer core, compiled trie workflow, tests, and public documentation
feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening
2026-04-13 02:10:46 +02:00

5.2 KiB

Quality and Operations

← Back to README.md

This document describes quality, testing, and operational practices for Radixor.

It focuses on:

  • reliability and determinism
  • testing strategies
  • deployment patterns
  • performance considerations
  • lifecycle management of stemmer data

Overview

Radixor is designed to separate:

  • data preparation (dictionary construction and compilation)
  • runtime execution (lookup and patch application)

This separation enables:

  • predictable runtime behavior
  • reproducible builds
  • controlled evolution of stemming data

Determinism and reproducibility

Radixor emphasizes deterministic behavior.

Deterministic outputs

Given:

  • the same dictionary input
  • the same reduction settings

Radixor guarantees:

  • identical compiled trie structure
  • identical value ordering
  • identical lookup results

Why this matters

  • stable search behavior across deployments
  • reproducible builds
  • easier debugging and regression analysis

Testing strategy

Unit testing

Core components should be tested independently:

  • patch encoding and decoding
  • trie construction
  • reduction behavior
  • binary serialization and deserialization

Dictionary validation tests

A recommended pattern:

  1. load dictionary input
  2. compile trie
  3. re-apply all word → stem mappings
  4. verify that:
  • expected stem is present in getAll()
  • preferred result (get()) is correct when deterministic

This ensures:

  • no data loss during reduction
  • correctness of patch encoding

Regression testing

Maintain a stable test dataset:

  • representative vocabulary
  • edge cases (short words, long words, ambiguous forms)

Use it to:

  • detect unintended changes
  • verify behavior after refactoring
  • validate reduction mode changes

Performance testing

Performance should be evaluated in terms of:

Throughput

  • words processed per second

Latency

  • time per lookup

Memory footprint

  • size of compiled trie
  • runtime memory usage

Benchmark with:

  • realistic token streams
  • production-like dictionaries

Deployment model

  1. prepare dictionary data
  2. compile using CLI
  3. store .radixor.gz artifact
  4. deploy artifact with application
  5. load using loadBinary(...)

Why this model

  • avoids runtime compilation overhead
  • reduces startup latency
  • ensures consistent behavior across environments

Artifact management

Compiled stemmers should be treated as versioned assets.

Versioning

  • include version in filename or metadata
  • track dictionary source and reduction settings

Example:

english-v1.2-ranked.radixor.gz

Storage

  • store in repository or artifact storage
  • ensure consistent distribution across environments

Runtime usage

Loading

  • load once during application startup
  • reuse FrequencyTrie instance

Thread safety

  • compiled trie is safe for concurrent access
  • no synchronization required for reads

Avoid repeated loading

Do not:

  • load trie per request
  • rebuild trie at runtime

Memory considerations

  • compiled tries are compact but not negligible
  • size depends on:
    • dictionary size
    • reduction mode

Recommendations:

  • monitor memory usage in production
  • choose reduction mode appropriately

Reduction mode in production

Default recommendation:

  • use ranked mode

Switch to other modes only when:

  • memory constraints are strict
  • multiple candidate results are not required

Always validate behavior after changing reduction mode.

Dictionary lifecycle

Updating dictionaries

When dictionary data changes:

  1. update source file
  2. recompile
  3. run validation tests
  4. deploy new artifact

Backward compatibility

  • changes in dictionary may affect stemming results
  • evaluate impact on search relevance

Observability

Radixor itself does not provide observability features; integration should provide:

  • logging for loading failures
  • metrics for lookup throughput
  • monitoring of memory usage

Optional:

  • sampling of ambiguous results (getAll())

Error handling

During compilation

Handle:

  • invalid dictionary format
  • I/O failures
  • invalid arguments

During runtime

Handle:

  • missing dictionary files
  • corrupted binary artifacts

Fail fast on initialization errors.

Operational best practices

  • compile dictionaries offline
  • version compiled artifacts
  • test before deployment
  • load once and reuse
  • monitor performance and memory
  • document reduction settings used

Security considerations

  • treat dictionary input as trusted data
  • validate external sources before compilation
  • avoid loading unverified binary artifacts

Integration checklist

Before production deployment:

  • dictionary validated
  • compiled artifact generated
  • reduction mode documented
  • performance tested
  • memory usage verified
  • regression tests passing

Next steps

Summary

Radixor is designed for:

  • deterministic behavior
  • efficient runtime execution
  • controlled data-driven evolution

By separating compilation from runtime and following proper operational practices, it can be reliably integrated into production-grade systems.