Skip to content

Benchmarks

semvec ships with seven live-LLM benchmark harnesses plus a set of unit-level regression tests that cover the same parity paths without network calls.

Overview

Harness Kind What it measures Runtime
run_longbench.py Live LLM LongBench-v2 token savings vs growing history ~10 min
run_mtbench.py Live LLM MT-Bench (80 × 2 turns) per-conversation cost ~10 min
run_core_state_llm.py Live LLM 20-turn chat: core metric deltas per turn ~40 s
run_cortex_llm.py Live LLM 3-agent SemvecAgentNetwork coherence deltas ~40 s
run_consensus_llm.py Live LLM Consensus across 5 levels × 5 topics ~1 min
run_longmemeval_parity.py Live LLM Per-entry pss vs semvec on LongMemEval-S ~20 s/entry
run_coding_replay.py Offline 30-turn compaction byte-parity ~10 s

For the complete reproduction guide including .env setup and Quick-Commands, see Running benchmarks. For the drift envelope that every semvec release must stay inside, see Parity envelope.

Official LongMemEval CLI

The LongMemEval harness is additionally exposed as a module-level CLI:

python -m semvec.benchmarks.longmemeval \
    --variant S --multi-pss --temperature 0.0 \
    --embed-device cuda \
    --output results/semvec_full.json

See Benchmarks API → CLI for the full flag table.

Regression tests (free to run)

These tests mirror the live benchmarks structurally but use mocked LLMs and a deterministic test embedder. They run as part of pytest tests/:

Test file Tests
tests/test_cortex_parity.py 26
tests/test_coding_parity.py 16
tests/test_coding_engine_replay_parity.py 9
tests/test_longmemeval_ingest_parity.py 5
tests/test_longmemeval_module.py 11
tests/test_retrieval_projection_injection.py 7
tests/test_coding_no_fallback.py 6
tests/test_token_reduction_clients.py 20
tests/test_coding_mcp_and_hooks.py 19

All of these must stay green before a release; they cost nothing to run and catch behavioural regressions within seconds.