Skip to content

Running benchmarks

All live-LLM harnesses read credentials from a .env at the repository root. Create it once before running anything.

.env

OPENAI_BASE_URL=https://your-endpoint/v1
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-oss-120b

# Optional separate judge endpoint — falls back to OPENAI_* if unset.
JUDGE_OPENAI_BASE_URL=https://your-judge-endpoint/v1
JUDGE_OPENAI_MODEL=gpt-oss-120b
JUDGE_OPENAI_API_KEY=sk-...

Embedder

Every runner requires a real embedder. Install SentenceTransformer into the project venv:

.venv/bin/pip install sentence-transformers

The default model is all-MiniLM-L6-v2 (matches the pss reference). Override with --embed-model / --embed-device where accepted.

Quick commands

LongMemEval official CLI

.venv/bin/python -m semvec.benchmarks.longmemeval \
    --variant S --multi-pss --temperature 0.0 \
    --embed-device cuda \
    --output results/semvec_full.json

Balanced 60-entry comparison run:

.venv/bin/python -m semvec.benchmarks.longmemeval \
    --variant S --multi-pss --per-type 10 \
    --temperature 0.2 --n-judges 3 \
    --embed-device cuda \
    --output results/semvec_pertype10.json

Resume a crashed run (skip the first 180 entries):

.venv/bin/python -m semvec.benchmarks.longmemeval \
    --variant S --multi-pss --temperature 0.0 \
    --skip-entries 180 \
    --output results/semvec_full_part2.json

Side-by-side pss vs semvec (LongMemEval)

.venv/bin/python benchmarks/run_longmemeval_parity.py \
    --entries 30 \
    --output benchmarks/results/longmemeval_parity_30.json

Cortex multi-agent parity

.venv/bin/python benchmarks/run_cortex_llm.py \
    --turns 15 \
    --output benchmarks/results/cortex_llm_parity.json

Consensus voting parity (5 topics × 5 levels)

.venv/bin/python benchmarks/run_consensus_llm.py \
    --rounds 5 \
    --output benchmarks/results/consensus_llm_parity.json

Core-state per-turn deltas on a 20-turn Q&A

.venv/bin/python benchmarks/run_core_state_llm.py \
    --turns 20 \
    --output benchmarks/results/core_state_llm_parity.json

Full LongBench-v2 run (503 questions)

.venv/bin/python benchmarks/run_longbench.py \
    --output benchmarks/results/longbench_v2_real.json

MT-Bench (80 × 2 turns)

.venv/bin/python benchmarks/run_mtbench.py \
    --output benchmarks/results/mt_bench_full.json

Coding compaction replay (offline, 30/30 byte-identical)

.venv/bin/python benchmarks/run_coding_replay.py \
    --prompts threejs \
    --output benchmarks/results/coding_replay_threejs.json

.venv/bin/python benchmarks/run_coding_replay.py \
    --prompts multifile \
    --output benchmarks/results/coding_replay_multifile.json

Result file layout

All benchmarks write a JSON with two top-level keys:

{
  "summary": { "...": "..." },
  "records": [ { per-entry or per-turn } ]
}

The per-entry records carry pss_* and sv_* mirrored fields so the two implementations can be diffed directly.