Benchmarks¶

semvec ships with seven live-LLM benchmark harnesses plus a set of unit-level regression tests that cover the same parity paths without network calls.

Overview¶

Harness	Kind	What it measures	Runtime
`run_longbench.py`	Live LLM	LongBench-v2 token savings vs growing history	~10 min
`run_mtbench.py`	Live LLM	MT-Bench (80 × 2 turns) per-conversation cost	~10 min
`run_core_state_llm.py`	Live LLM	20-turn chat: core metric deltas per turn	~40 s
`run_cortex_llm.py`	Live LLM	3-agent SemvecAgentNetwork coherence deltas	~40 s
`run_consensus_llm.py`	Live LLM	Consensus across 5 levels × 5 topics	~1 min
`run_longmemeval_parity.py`	Live LLM	Per-entry pss vs semvec on LongMemEval-S	~20 s/entry
`run_coding_replay.py`	Offline	30-turn compaction byte-parity	~10 s

For the complete reproduction guide including .env setup and Quick-Commands, see Running benchmarks. For the drift envelope that every semvec release must stay inside, see Parity envelope.

Official LongMemEval CLI¶

The LongMemEval harness is additionally exposed as a module-level CLI:

python -m semvec.benchmarks.longmemeval \
    --variant S --multi-pss --temperature 0.0 \
    --embed-device cuda \
    --output results/semvec_full.json

See Benchmarks API → CLI for the full flag table.

Regression tests (free to run)¶

These tests mirror the live benchmarks structurally but use mocked LLMs and a deterministic test embedder. They run as part of pytest tests/:

Test file	Tests
`tests/test_cortex_parity.py`	26
`tests/test_coding_parity.py`	16
`tests/test_coding_engine_replay_parity.py`	9
`tests/test_longmemeval_ingest_parity.py`	5
`tests/test_longmemeval_module.py`	11
`tests/test_retrieval_projection_injection.py`	7
`tests/test_coding_no_fallback.py`	6
`tests/test_token_reduction_clients.py`	20
`tests/test_coding_mcp_and_hooks.py`	19

All of these must stay green before a release; they cost nothing to run and catch behavioural regressions within seconds.