Parity envelope¶

The values below are the release gate for every semvec build. They cover engine-internal parity (Rust core vs the pure-Python reference implementation) and API determinism. They are independent of the LOCOMO numbers in Benchmarks, which target end-to-end answer quality.

Structural parity (must hold)¶

Property	Envelope
Phase-detector decision	Bit-identical on identical input across the parity test suite
Serializer output (short haystack, < 100 chunks)	Byte-identical
Consensus decision (all 5 levels)	100 % agreement over 25 LLM-driven rounds
`network_resonance` parity	≤ 1.1 × 10⁻¹⁶ (machine epsilon)
BM25-hybrid index → cosine-only fallback	Identical when `SEMVEC_HYBRID_BM25=0`
`/v1/run` context block (same input)	Byte-identical within a single process

Per-turn numeric deltas (documented drift)¶

Per-turn metric values are deterministic within a release and trend-comparable across releases. Exact numeric tolerances are not published. For parity testing you can pin the retrieval projection matrix via the methods documented in the Core API.

LLM-call stochasticity (NOT parity)¶

When measuring end-to-end answer quality (LOCOMO et al.) note that gpt-4o via OpenRouter is non-deterministic even at temperature=0 because OpenRouter routes between providers (OpenAI direct, Azure, …) that produce minutely different outputs. Empirically:

~60 % of LOCOMO QA predictions are byte-identical across repeat runs
~40 % drift on punctuation / casing / "no info" vs guess
aggregate F1 drifts ≤ ±0.5 pp

Plan for this when reading bench reports: drift inside ±1 pp on the aggregate is expected, not a regression.

Test suite¶

The parity assertions live in tests/test_core_port.py, tests/test_compaction_port.py, tests/test_cortex_port.py and tests/test_audit*.py. They run under pytest without external dependencies (no LLM, no embedder).