docs: improve README, MkDocs content, branding assets, and site polish (2)

This commit is contained in:
2026-04-19 00:20:24 +02:00
parent 0b674a39a8
commit 0dc516357f
3 changed files with 411 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 540 KiB

View File

@@ -0,0 +1,185 @@
# Compatibility and Guarantees
This document explains what Radixor treats as stable public behavior, what should be regarded as internal implementation detail, and how to think about compatibility across versions.
Its purpose is to make adoption safer. Users should be able to understand which parts of the project are intended as supported API, which parts may evolve more freely, and which kinds of change are expected to remain compatible in future releases.
## Compatibility philosophy
Radixor is designed to be used as a real library, not only as a code drop. That means compatibility matters.
At the same time, the project distinguishes clearly between:
- **public API and behavior** that users are expected to build against,
- **internal implementation layers** that may change more freely when needed for correctness, performance, or maintainability.
The practical goal is straightforward:
- keep the main user-facing API in `org.egothor.stemmer` stable and supportable,
- allow more freedom of evolution in internal trie-focused implementation layers,
- extend the project conservatively without creating unnecessary behavioral ambiguity.
## Public API posture
As a general rule, the `org.egothor.stemmer` package should be treated as the primary supported API surface.
That includes the main user-facing types involved in:
- dictionary loading,
- binary loading and persistence,
- patch-command application,
- compiled trie querying,
- reconstruction workflows,
- reduction configuration,
- CLI use.
This API is expected to remain supportable across future versions. The preferred compatibility model is additive evolution: improving documentation, clarifying behavior, and adding capabilities without unnecessary disruption of existing usage patterns.
Examples of likely additive evolution include:
- additional bundled language resources,
- fuller support for diacritics or native-script language resources,
- expanded documentation and operational tooling,
- new convenience methods that do not break existing code.
## Internal API posture
The `org.egothor.stemmer.trie` package should be treated as internal or at least significantly less stable implementation API.
It represents the structural machinery behind mutable nodes, reduced nodes, compiled nodes, reduction context, signatures, and related internal compilation details. These types may evolve more aggressively when needed to improve implementation quality, correctness, reduction behavior, internal representations, or performance characteristics.
Users should therefore avoid building long-term integrations against `org.egothor.stemmer.trie` unless they are intentionally accepting that tighter coupling.
In practical terms:
- `org.egothor.stemmer` is the supported integration layer,
- `org.egothor.stemmer.trie` is the implementation layer.
## Behavioral guarantees
Several project properties are intended as core behavioral guarantees.
### Deterministic dictionary loading and compilation
Given the same textual dictionary input and the same reduction settings, Radixor is intended to produce the same compiled stemming semantics in a reproducible way.
This includes deterministic local result ordering and deterministic observable lookup behavior.
### Stable meaning of `get()` and `getAll()`
The distinction between preferred-result lookup and multi-result lookup is part of the supported behavior model.
- `get()` returns the locally preferred stored value,
- `getAll()` returns all locally stored values in deterministic ranked order,
- `getEntries()` returns aligned values with counts.
That model is part of how the public API should be understood.
### Stable reduction-mode intent
Each public `ReductionMode` constant carries a semantic contract that should remain meaningful across versions.
In other words, the implementation may evolve, but the intended meaning of modes such as ranked `getAll()` equivalence, unordered `getAll()` equivalence, and dominant `get()` equivalence should not drift casually.
### Stable binary artifact purpose
Compiled `.radixor.gz` artifacts are a first-class project output. Loading and persisting compiled stemmer artifacts is part of the intended usage model, not an incidental implementation side effect.
## What is allowed to evolve
Compatibility does not mean the project is frozen.
The following kinds of change are generally compatible with the projects direction:
- improved internal data structures,
- changes inside `org.egothor.stemmer.trie`,
- expanded bundled dictionaries,
- additional supported languages,
- improved native-script handling,
- better benchmarks, tests, and reports,
- additive public API growth that does not invalidate existing usage.
The project should be able to improve substantially while keeping the main user-facing integration model intact.
## What may change more cautiously
Some areas should be treated as stable in intent but still approached carefully when changed.
### Bundled dictionary contents
Bundled resources are versioned project data, not immutable language standards. Their contents may improve over time.
That means stemming outcomes can legitimately change when bundled dictionaries are refined or expanded. Such changes are compatible with the projects direction, but they should still be understood as behavior changes at the lexical-resource level.
### Binary format evolution
Compiled binary artifacts are an intended project output, but binary-format evolution may still be needed in future versions.
If the format changes, that should be handled deliberately and documented clearly. Users should not assume that every historical persisted artifact will remain readable forever without versioning considerations. What should remain stable is the projects support for compiled artifact workflows, not necessarily perpetual cross-version binary interchange without explicit format evolution rules.
### Performance characteristics
Radixor places strong emphasis on performance, but no benchmark number should be treated as a formal compatibility guarantee.
What is more meaningful than any single raw number is the architectural performance posture: the library is intended to remain a compact compiled stemmer with very strong runtime throughput characteristics.
## What users should rely on
Long-term users should rely primarily on the following:
- the main integration path in `org.egothor.stemmer`,
- the documented meaning of `get()`, `getAll()`, and reduction modes,
- the offline-compilation plus runtime-loading workflow,
- the availability of compiled artifact support,
- the projects preference for deterministic and auditable behavior.
These are the parts of the project that are intended to remain the most stable and supportable.
## What users should not rely on casually
Users should avoid depending on:
- internal trie package details,
- undocumented internal classes or intermediate representations,
- incidental internal ordering outside documented lookup semantics,
- assumptions that bundled dictionary contents will never evolve,
- assumptions that internal binary-format details are frozen forever.
If a behavior is important to your integration, it should ideally be documented at the public API or project-documentation level rather than inferred from internal implementation details.
## Source compatibility and behavioral compatibility
It is useful to distinguish two different notions of compatibility.
### Source compatibility
Whether existing Java code using the supported public API still compiles and integrates cleanly after an upgrade.
### Behavioral compatibility
Whether the upgraded system still behaves the same way for the same dictionary data, compiled artifacts, and runtime calls.
Radixor aims to preserve both where reasonably possible, but behavioral compatibility can still be influenced by intentional improvements such as dictionary refinement or bug fixes. For that reason, upgrades should be evaluated not only as code upgrades but also as stemming-behavior upgrades.
## Recommended upgrade discipline
When upgrading Radixor in a production environment, it is good practice to:
1. review release notes and documentation changes,
2. rebuild compiled artifacts if the upgrade affects dictionary or artifact handling,
3. rerun representative stemming validation tests,
4. compare benchmark outputs where performance matters,
5. inspect whether bundled-dictionary changes affect expected canonical results.
This is especially important for deployments that treat stemming behavior as part of search relevance or normalization policy.
## Summary
Radixors compatibility model is intentionally layered.
- `org.egothor.stemmer` should be treated as the supported public integration API,
- `org.egothor.stemmer.trie` should be treated as an internal implementation layer,
- deterministic public behavior and compiled-artifact workflows are core project commitments,
- internal structure and lexical-resource quality can continue to evolve.
This model gives the project room to improve while still providing a reliable surface for long-term use.

View File

@@ -0,0 +1,226 @@
# Contributing Dictionaries
High-quality dictionaries are one of the most valuable ways to improve **Radixor**.
The project already includes practical bundled dictionaries for common use, but the long-term quality and language reach of the stemmer depend heavily on the quality of its lexical resources. Contributions are therefore welcome not only in the form of code changes, but also in the form of well-prepared dictionary data for existing or additional languages.
This document explains what makes a dictionary contribution useful, how to structure it, and how to prepare it so that it integrates cleanly with the project.
## What a good dictionary contribution looks like
A good dictionary contribution is not defined only by the number of entries.
The most useful contributions are dictionaries that are:
- linguistically consistent,
- operationally clean,
- easy to review,
- easy to reproduce,
- appropriate for actual stemming use rather than raw lexical accumulation.
In practice, dictionary quality matters more than dictionary size. A smaller but coherent and carefully normalized dictionary is often more valuable than a larger resource that mixes conventions, contains noisy forms, or introduces accidental ambiguity.
## Preferred dictionary shape
Radixor uses a simple line-oriented format:
```text
<stem> <variant1> <variant2> <variant3> ...
```
The first token on a line is the canonical stem. All following tokens on that line are known variants that should reduce to that stem.
Example:
```text
run running runs ran
connect connected connecting connection
```
The parser:
- reads UTF-8 text,
- normalizes input to lower case using `Locale.ROOT`,
- ignores empty lines,
- supports remarks introduced by `#` or `//`.
For full format details, see [Dictionary format](dictionary-format.md).
## Contribution priorities
The most useful dictionary contributions generally fall into one of four categories.
### 1. Stronger dictionaries for already bundled languages
Improving lexical quality for already supported languages is often more valuable than merely expanding the language list. Better coverage, cleaner canonicalization, and improved consistency directly improve practical stemming outcomes.
### 2. Additional languages
New language support is welcome when the submitted resource is strong enough to be useful as a maintainable bundled baseline rather than as an incomplete demonstration artifact.
### 3. Native-script language resources
The current bundled resources follow a pragmatic normalization convention and may use transliterated or otherwise normalized forms. This is especially visible for languages such as Russian.
That convention belongs to the supplied dictionaries, not to the underlying algorithm. The parser, trie, and patch-command model are not fundamentally restricted to plain ASCII. Contributions of high-quality native-script dictionaries in full UTF-8 text are therefore particularly valuable, because they would enable more direct language support without transliteration-based workflows.
### 4. Domain-quality refinements
Some contributions may be more appropriate as curated domain extensions than as replacements for a general-purpose bundled dictionary. These are still useful when they are clearly scoped and operationally coherent.
## Normalization guidance
A dictionary should follow one normalization convention consistently.
For current general-purpose bundled resources, the safest convention remains normalized plain-ASCII lexical input where that is already the established project style. For languages where a stronger native-script resource exists, a coherent UTF-8 dictionary may be preferable, provided that the contribution is deliberate, well-structured, and consistently normalized.
The important point is not to mix incompatible conventions casually.
Avoid contributions that combine, without clear design intent:
- native-script and transliterated forms,
- multiple incompatible stem conventions,
- inconsistent use of diacritics,
- ad hoc spelling normalization,
- noisy typo-like forms presented as ordinary lexical variants.
## Choosing canonical stems
A dictionary line should reflect a stable canonical target form.
That means:
- choose one canonical representation and use it consistently,
- avoid mixing alternative stem conventions without a clear lexical reason,
- keep variants grouped under the form that the project should actually return as the canonical result.
For example, the following is coherent:
```text
analyze analyzing analyzed analyzes
```
The following is less useful if the project has not intentionally chosen mixed conventions:
```text
analyse analyzing analyzed analyzes
```
The contribution should make the intended canonical policy easy to understand.
## Ambiguity handling
Ambiguity is allowed, but it should be intentional.
If the same surface form appears under multiple stems, the compiled trie may later expose multiple candidate patch commands. This can be correct and desirable when the lexical reality genuinely requires it. However, accidental ambiguity caused by inconsistent source preparation makes the resource harder to trust and harder to review.
Before contributing a dictionary, check whether repeated surface forms across lines are:
- linguistically intentional,
- consistent with the chosen canonical policy,
- useful for runtime stemming behavior.
## What to avoid
Dictionary contributions are much easier to review and accept when they avoid common quality problems.
Avoid:
- mechanically aggregated word lists without review,
- inconsistent canonical forms,
- mixed orthographic conventions without explanation,
- accidental duplicates caused by source merging,
- noisy or non-lexical tokens,
- comments or formatting that make the source hard to audit.
A dictionary should read like a curated lexical resource, not like an unfiltered export.
## Practical preparation workflow
A disciplined dictionary contribution should typically follow this path:
1. prepare or normalize the lexical source,
2. convert it into Radixor dictionary format,
3. review canonical stem choices,
4. check for accidental duplicates and unintended ambiguity,
5. compile the dictionary,
6. test representative lookups,
7. inspect `get()` and `getAll()` behavior for important edge cases,
8. include a concise explanation of source provenance and normalization choices.
## What to test before submitting
At minimum, a proposed dictionary should be checked for:
- successful parsing,
- successful compilation,
- expected stemming behavior on representative examples,
- acceptable ambiguity behavior,
- stable canonical policy,
- absence of obvious malformed lines or accidental source contamination.
For important resources, it is also useful to test:
- whether representative forms survive reduction as expected,
- whether dominant-result behavior remains sensible if alternate reduction modes are used,
- whether the resulting artifact has a practical size for the intended use case.
## Contribution notes that help maintainers
A dictionary contribution becomes much easier to review when it includes a short maintainer-facing note describing:
- the language or domain covered,
- the provenance of the lexical data,
- the normalization convention used,
- whether the dictionary is ASCII-normalized or native-script UTF-8,
- the intended canonical stem policy,
- any known limitations,
- why the contribution improves the project in practical terms.
This note does not need to be long. It simply needs to make the resource intelligible.
## Bundled-resource expectations
Not every useful dictionary must automatically become a bundled language resource.
To be suitable for bundling, a dictionary should generally be:
- broadly useful,
- maintainable,
- legally safe to include,
- coherent enough to serve as a project baseline,
- strong enough that users can rely on it as more than a demonstration resource.
Some dictionaries are better treated as examples, experiments, or domain-specific artifacts rather than as general built-in resources.
## Native scripts and future language support
One of the most meaningful future directions for the project is stronger support for languages in their native writing systems.
The architecture does not need to change fundamentally for that to happen. What matters is the availability of strong lexical resources and the willingness to define clear conventions for how those resources should be bundled and maintained.
Contributions in this area are therefore especially valuable when they are:
- internally consistent,
- encoded as proper UTF-8 text,
- accompanied by a clear explanation of normalization assumptions,
- strong enough to support practical use rather than only demonstration.
## Related documentation
- [Built-in languages](built-in-languages.md)
- [Dictionary format](dictionary-format.md)
- [CLI compilation](cli-compilation.md)
- [Programmatic usage](programmatic-usage.md)
## Summary
The best dictionary contributions improve Radixor not merely by adding more entries, but by improving the linguistic quality, consistency, and practical usefulness of the lexical resources the project can compile and ship.
A strong contribution is therefore one that is:
- coherent,
- reviewable,
- operationally clean,
- well explained,
- and valuable for real stemming workloads.