Benchmarks¶
semvec ships with seven live-LLM benchmark harnesses plus a set of unit-level regression tests that cover the same parity paths without network calls.
Overview¶
| Harness | Kind | What it measures | Runtime |
|---|---|---|---|
run_longbench.py |
Live LLM | LongBench-v2 token savings vs growing history | ~10 min |
run_mtbench.py |
Live LLM | MT-Bench (80 × 2 turns) per-conversation cost | ~10 min |
run_core_state_llm.py |
Live LLM | 20-turn chat: core metric deltas per turn | ~40 s |
run_cortex_llm.py |
Live LLM | 3-agent SemvecAgentNetwork coherence deltas | ~40 s |
run_consensus_llm.py |
Live LLM | Consensus across 5 levels × 5 topics | ~1 min |
run_longmemeval_parity.py |
Live LLM | Per-entry pss vs semvec on LongMemEval-S | ~20 s/entry |
run_coding_replay.py |
Offline | 30-turn compaction byte-parity | ~10 s |
For the complete reproduction guide including .env setup and Quick-Commands, see Running benchmarks. For the drift envelope that every semvec release must stay inside, see Parity envelope.
Official LongMemEval CLI¶
The LongMemEval harness is additionally exposed as a module-level CLI:
python -m semvec.benchmarks.longmemeval \
--variant S --multi-pss --temperature 0.0 \
--embed-device cuda \
--output results/semvec_full.json
See Benchmarks API → CLI for the full flag table.
Regression tests (free to run)¶
These tests mirror the live benchmarks structurally but use mocked LLMs and a deterministic test embedder. They run as part of pytest tests/:
| Test file | Tests |
|---|---|
tests/test_cortex_parity.py |
26 |
tests/test_coding_parity.py |
16 |
tests/test_coding_engine_replay_parity.py |
9 |
tests/test_longmemeval_ingest_parity.py |
5 |
tests/test_longmemeval_module.py |
11 |
tests/test_retrieval_projection_injection.py |
7 |
tests/test_coding_no_fallback.py |
6 |
tests/test_token_reduction_clients.py |
20 |
tests/test_coding_mcp_and_hooks.py |
19 |
All of these must stay green before a release; they cost nothing to run and catch behavioural regressions within seconds.