Files
Radixor/docs/quality-and-operations.md
Leo Galambos 038514bad0 Refine stemmer core, compiled trie workflow, tests, and public documentation
feat: implement Compile CLI for building binary stemmer tables from source dictionaries
feat: add loading support for persisted compiled tries, including GZip-compressed binaries
feat: add a builder path for recreating a writable trie from a compiled trie
feat: expose read-only value/count access for compiled trie entries
feat: support deterministic NOOP patch encoding for identical source and target words

fix: make value selection deterministic for equal frequencies using length and lexical tie-breakers
fix: preserve valid alternative reductions during trie optimization and reduction
fix: correct patch command edge cases discovered in round-trip and malformed-input tests
fix: address persistence and compiled-trie handling defects found during implementation review
fix: resolve test failures and behavioral regressions uncovered by PMD and JUnit runs

refactor: reorganize trie-related support types into dedicated packages and classes
refactor: simplify the core FrequencyTrie design toward a cleaner practical architecture
refactor: improve compiled/read-only trie boundaries without restoring mutability
refactor: clean up internal reduction, serialization, and helper structure

test: add professional JUnit coverage for stemmer core classes
test: split trie tests into dedicated test classes per production type
test: improve parameterized tests for readability, diagnostics, and edge-case traceability
test: cover positive, negative, malformed, persistence, and round-trip scenarios
test: verify compiled dictionaries against source inputs using getAll semantics

docs: write public README and supplementary Markdown documentation for project publishing
docs: document architecture, reduction model, built-in languages, and operational guidance
docs: clarify reverse-word storage, mutable construction, and compiled-trie runtime behavior
docs: remove placeholders, vague buzzwords, and unexplained terminology from the documentation
docs: improve examples and wording for professional reader-facing project guidance

chore: align project materials with the practical Radix scope and Egothor/Stempel lineage
chore: raise overall project quality through documentation review and test hardening
2026-04-13 02:10:46 +02:00

318 lines
5.2 KiB
Markdown

# Quality and Operations
> ← Back to [README.md](../README.md)
This document describes quality, testing, and operational practices for **Radixor**.
It focuses on:
- reliability and determinism
- testing strategies
- deployment patterns
- performance considerations
- lifecycle management of stemmer data
## Overview
Radixor is designed to separate:
- **data preparation** (dictionary construction and compilation)
- **runtime execution** (lookup and patch application)
This separation enables:
- predictable runtime behavior
- reproducible builds
- controlled evolution of stemming data
## Determinism and reproducibility
Radixor emphasizes deterministic behavior.
### Deterministic outputs
Given:
- the same dictionary input
- the same reduction settings
Radixor guarantees:
- identical compiled trie structure
- identical value ordering
- identical lookup results
### Why this matters
- stable search behavior across deployments
- reproducible builds
- easier debugging and regression analysis
## Testing strategy
### Unit testing
Core components should be tested independently:
- patch encoding and decoding
- trie construction
- reduction behavior
- binary serialization and deserialization
### Dictionary validation tests
A recommended pattern:
1. load dictionary input
2. compile trie
3. re-apply all word → stem mappings
4. verify that:
- expected stem is present in `getAll()`
- preferred result (`get()`) is correct when deterministic
This ensures:
- no data loss during reduction
- correctness of patch encoding
## Regression testing
Maintain a stable test dataset:
- representative vocabulary
- edge cases (short words, long words, ambiguous forms)
Use it to:
- detect unintended changes
- verify behavior after refactoring
- validate reduction mode changes
## Performance testing
Performance should be evaluated in terms of:
### Throughput
- words processed per second
### Latency
- time per lookup
### Memory footprint
- size of compiled trie
- runtime memory usage
Benchmark with:
- realistic token streams
- production-like dictionaries
## Deployment model
### Recommended workflow
1. prepare dictionary data
2. compile using CLI
3. store `.radixor.gz` artifact
4. deploy artifact with application
5. load using `loadBinary(...)`
### Why this model
- avoids runtime compilation overhead
- reduces startup latency
- ensures consistent behavior across environments
## Artifact management
Compiled stemmers should be treated as versioned assets.
### Versioning
- include version in filename or metadata
- track dictionary source and reduction settings
Example:
```
english-v1.2-ranked.radixor.gz
```
### Storage
- store in repository or artifact storage
- ensure consistent distribution across environments
## Runtime usage
### Loading
- load once during application startup
- reuse `FrequencyTrie` instance
### Thread safety
- compiled trie is safe for concurrent access
- no synchronization required for reads
### Avoid repeated loading
Do not:
- load trie per request
- rebuild trie at runtime
## Memory considerations
- compiled tries are compact but not negligible
- size depends on:
- dictionary size
- reduction mode
Recommendations:
- monitor memory usage in production
- choose reduction mode appropriately
## Reduction mode in production
Default recommendation:
- use **ranked mode**
Switch to other modes only when:
- memory constraints are strict
- multiple candidate results are not required
Always validate behavior after changing reduction mode.
## Dictionary lifecycle
### Updating dictionaries
When dictionary data changes:
1. update source file
2. recompile
3. run validation tests
4. deploy new artifact
### Backward compatibility
- changes in dictionary may affect stemming results
- evaluate impact on search relevance
## Observability
Radixor itself does not provide observability features; integration should provide:
- logging for loading failures
- metrics for lookup throughput
- monitoring of memory usage
Optional:
- sampling of ambiguous results (`getAll()`)
## Error handling
### During compilation
Handle:
- invalid dictionary format
- I/O failures
- invalid arguments
### During runtime
Handle:
- missing dictionary files
- corrupted binary artifacts
Fail fast on initialization errors.
## Operational best practices
- compile dictionaries offline
- version compiled artifacts
- test before deployment
- load once and reuse
- monitor performance and memory
- document reduction settings used
## Security considerations
- treat dictionary input as trusted data
- validate external sources before compilation
- avoid loading unverified binary artifacts
## Integration checklist
Before production deployment:
- dictionary validated
- compiled artifact generated
- reduction mode documented
- performance tested
- memory usage verified
- regression tests passing
## Next steps
- [Quick start](quick-start.md)
- [CLI compilation](cli-compilation.md)
- [Programmatic usage](programmatic-usage.md)
## Summary
Radixor is designed for:
- deterministic behavior
- efficient runtime execution
- controlled data-driven evolution
By separating compilation from runtime and following proper operational practices, it can be reliably integrated into production-grade systems.