Running benchmarks¶
All live-LLM harnesses read credentials from a .env at the repository root. Create it once before running anything.
.env¶
OPENAI_BASE_URL=https://your-endpoint/v1
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-oss-120b
# Optional separate judge endpoint — falls back to OPENAI_* if unset.
JUDGE_OPENAI_BASE_URL=https://your-judge-endpoint/v1
JUDGE_OPENAI_MODEL=gpt-oss-120b
JUDGE_OPENAI_API_KEY=sk-...
Embedder¶
Every runner requires a real embedder. Install SentenceTransformer into the project venv:
The default model is all-MiniLM-L6-v2 (matches the pss reference). Override with --embed-model / --embed-device where accepted.
Quick commands¶
LongMemEval official CLI¶
.venv/bin/python -m semvec.benchmarks.longmemeval \
--variant S --multi-pss --temperature 0.0 \
--embed-device cuda \
--output results/semvec_full.json
Balanced 60-entry comparison run:
.venv/bin/python -m semvec.benchmarks.longmemeval \
--variant S --multi-pss --per-type 10 \
--temperature 0.2 --n-judges 3 \
--embed-device cuda \
--output results/semvec_pertype10.json
Resume a crashed run (skip the first 180 entries):
.venv/bin/python -m semvec.benchmarks.longmemeval \
--variant S --multi-pss --temperature 0.0 \
--skip-entries 180 \
--output results/semvec_full_part2.json
Side-by-side pss vs semvec (LongMemEval)¶
.venv/bin/python benchmarks/run_longmemeval_parity.py \
--entries 30 \
--output benchmarks/results/longmemeval_parity_30.json
Cortex multi-agent parity¶
.venv/bin/python benchmarks/run_cortex_llm.py \
--turns 15 \
--output benchmarks/results/cortex_llm_parity.json
Consensus voting parity (5 topics × 5 levels)¶
.venv/bin/python benchmarks/run_consensus_llm.py \
--rounds 5 \
--output benchmarks/results/consensus_llm_parity.json
Core-state per-turn deltas on a 20-turn Q&A¶
.venv/bin/python benchmarks/run_core_state_llm.py \
--turns 20 \
--output benchmarks/results/core_state_llm_parity.json
Full LongBench-v2 run (503 questions)¶
MT-Bench (80 × 2 turns)¶
Coding compaction replay (offline, 30/30 byte-identical)¶
.venv/bin/python benchmarks/run_coding_replay.py \
--prompts threejs \
--output benchmarks/results/coding_replay_threejs.json
.venv/bin/python benchmarks/run_coding_replay.py \
--prompts multifile \
--output benchmarks/results/coding_replay_multifile.json
Result file layout¶
All benchmarks write a JSON with two top-level keys:
The per-entry records carry pss_* and sv_* mirrored fields so the two implementations can be diffed directly.