Parity envelope¶
The values below are the release gate for every semvec build. They cover engine-internal parity (Rust core vs the pure-Python reference implementation) and API determinism. They are independent of the LOCOMO numbers in Benchmarks, which target end-to-end answer quality.
Structural parity (must hold)¶
| Property | Envelope |
|---|---|
| Phase-detector decision | Bit-identical on identical input across the parity test suite |
| Serializer output (short haystack, < 100 chunks) | Byte-identical |
| Consensus decision (all 5 levels) | 100 % agreement over 25 LLM-driven rounds |
network_resonance parity |
≤ 1.1 × 10⁻¹⁶ (machine epsilon) |
| BM25-hybrid index → cosine-only fallback | Identical when SEMVEC_HYBRID_BM25=0 |
/v1/run context block (same input) |
Byte-identical within a single process |
Per-turn numeric deltas (documented drift)¶
Per-turn metric values are deterministic within a release and trend-comparable across releases. Exact numeric tolerances are not published. For parity testing you can pin the retrieval projection matrix via the methods documented in the Core API.
LLM-call stochasticity (NOT parity)¶
When measuring end-to-end answer quality (LOCOMO et al.) note that
gpt-4o via OpenRouter is non-deterministic even at temperature=0
because OpenRouter routes between providers (OpenAI direct, Azure, …)
that produce minutely different outputs. Empirically:
- ~60 % of LOCOMO QA predictions are byte-identical across repeat runs
- ~40 % drift on punctuation / casing / "no info" vs guess
- aggregate F1 drifts ≤ ±0.5 pp
Plan for this when reading bench reports: drift inside ±1 pp on the aggregate is expected, not a regression.
Test suite¶
The parity assertions live in tests/test_core_port.py,
tests/test_compaction_port.py, tests/test_cortex_port.py and
tests/test_audit*.py. They run under pytest without external
dependencies (no LLM, no embedder).