Running benchmarks¶
The LOCOMO bench runners live under benchmarks/ in the repo. They are
not executed by pytest — they require a real OpenAI-compatible LLM
endpoint and the SentenceTransformer embedder.
.env¶
At the repo root:
OPENAI_BASE_URL=https://your-endpoint/v1
OPENAI_API_KEY=sk-...
OPENAI_MODEL=openai/gpt-4o
# Optional separate judge endpoint — falls back to OPENAI_* if unset.
JUDGE_OPENAI_BASE_URL=https://your-judge-endpoint/v1
JUDGE_OPENAI_MODEL=openai/gpt-4o-mini
JUDGE_OPENAI_API_KEY=sk-...
# semvec API license token (for /v1/run, /v1/store)
SEMVEC_LICENSE_KEY=eyJ...
Install¶
The hybrid extra pulls bm25s + nltk (Porter stemmer) — required to
reproduce the +2.6 pp BM25-hybrid number. The api extra pulls FastAPI +
uvicorn so semvec serve works.
Optional mem0 SDK for head-to-head comparison:
Server config — current best¶
set -a && . ./.env && set +a
SEMVEC_RUN_TOP_K=15 \
SEMVEC_CONTEXT_BUDGET_CHARS=10000 \
SEMVEC_RERANK_MODEL="cross-encoder/ms-marco-MiniLM-L-6-v2" \
SEMVEC_RERANK_FETCH_K=50 \
SEMVEC_RERANK_BATCH=64 \
SEMVEC_HYBRID_BM25=1 \
SEMVEC_BM25_FETCH_K=50 \
semvec serve --host 127.0.0.1 --port 8080 --log-level info &
LOCOMO runs¶
Single-convo smoke (~6 min, ~$0.50)¶
.venv/bin/python -u benchmarks/run_locomo.py --conv-id 0 \
-o "benchmarks/results/locomo_conv26_$(date +%Y%m%d_%H%M%S).json" \
2>&1 | tee /tmp/locomo.log
Full 10-convo suite (~60 min, ~$6, 1986 QAs)¶
.venv/bin/python -u benchmarks/run_locomo.py --conv-id -1 \
-o "benchmarks/results/locomo_FULL_$(date +%Y%m%d_%H%M%S).json" \
2>&1 | tee /tmp/locomo.log
LLM-as-Judge re-evaluation¶
The runner can re-score predictions with the mem0 paper's judge prompt (verbatim). Use this to make semvec numbers directly comparable to mem0 / Zep / Letta numbers reported in the LOCOMO literature.
Re-judge a finished run in place (most common — adds a judge column to an existing predictions file):
.venv/bin/python -u benchmarks/run_locomo.py --judge \
-o benchmarks/results/locomo_FULL_20260513_091500.json \
--judge-model openai/gpt-4o-mini
Re-judge with the stand-alone helper (when you want a separate output file):
.venv/bin/python -u benchmarks/run_locomo_judge.py \
-i benchmarks/results/locomo_FULL_20260513_091500.json \
-o benchmarks/results/locomo_judge_$(date +%Y%m%d_%H%M%S).json \
--judge-model openai/gpt-4o-mini
The judge resolves credentials via JUDGE_OPENAI_* env vars (falls back
to OPENAI_* if the judge variables are unset). The OpenAI-compatible
adapter is requests-backed — no openai SDK dependency is required, so
pip install "semvec[benchmarks]" is sufficient to run both the bench
and the judge. .env lookup uses find_dotenv() and walks up from the
runner's CWD, so the judge works cleanly from worktrees and
sub-directories, not just the repo root.
Reading results¶
The result JSON contains per-QA entries with question, gold, pred,
category, and f1. A typical aggregator:
import json
d = json.loads(open("benchmarks/results/locomo_FULL_*.json").read())
for c in d["convos"]:
print(c["sample_id"], c["overall_f1"], c["n_qa"])
weighted = sum(c["overall_f1"] * c["n_qa"] for c in d["convos"]) \
/ sum(c["n_qa"] for c in d["convos"])
print(f"Weighted F1: {weighted:.4f}")
Reproducibility checklist¶
- Fresh session per conversation.
semvec.create_session()is not a top-level helper in the installed wheel — the bench driver uses the REST surface against a runningsemvec serve:
import httpx, uuid
client = httpx.Client(
base_url="http://127.0.0.1:8080/v1",
headers={"X-API-Key": SEMVEC_LICENSE_KEY},
timeout=httpx.Timeout(60.0),
)
sid = str(uuid.uuid4())
try:
client.post("/session/create", json={"session_id": sid})
# ... replay turns: client.post("/store", json={"session_id": sid, ...})
# ... evaluate: client.post("/run", json={"session_id": sid, ...})
finally:
client.delete(f"/session/{sid}")
The LOCOMO driver scripts (benchmarks/run_locomo.py,
benchmarks/run_locomo_judge.py) live in the semvec source repo
under benchmarks/, not in the PyPI wheel. Clone the repo to
run them; pip install "semvec[benchmarks]" only pulls the runtime
dependencies they need (httpx, datasets, scikit-learn). No
cross-conversation memory carry-over.
2. temperature=0 on every LLM call. Note: with gpt-4o via OpenRouter
there is still ~40 % per-QA stochasticity from provider routing; this
nets out to ≤ ±0.5 pp drift on the aggregate.
3. Same embedder for replay and queries (paraphrase-multilingual-mpnet-base-v2,
768d).
4. Hybrid env vars pinned as above. Default-off — set
SEMVEC_HYBRID_BM25=1 explicitly.
5. Cross-encoder model pinned (ms-marco-MiniLM-L-6-v2).
Wall-clock budget for the full suite: ~60 minutes, ~$6 of gpt-4o
OpenRouter credit. The replay phase dominates (~3 minutes per
500-turn convo).