Skip to content

Running benchmarks

The LOCOMO bench runners live under benchmarks/ in the repo. They are not executed by pytest — they require a real OpenAI-compatible LLM endpoint and the SentenceTransformer embedder.

.env

At the repo root:

OPENAI_BASE_URL=https://your-endpoint/v1
OPENAI_API_KEY=sk-...
OPENAI_MODEL=openai/gpt-4o

# Optional separate judge endpoint — falls back to OPENAI_* if unset.
JUDGE_OPENAI_BASE_URL=https://your-judge-endpoint/v1
JUDGE_OPENAI_MODEL=openai/gpt-4o-mini
JUDGE_OPENAI_API_KEY=sk-...

# semvec API license token (for /v1/run, /v1/store)
SEMVEC_LICENSE_KEY=eyJ...

Install

pip install "semvec[benchmarks,hybrid,api]"

The hybrid extra pulls bm25s + nltk (Porter stemmer) — required to reproduce the +2.6 pp BM25-hybrid number. The api extra pulls FastAPI + uvicorn so semvec serve works.

Optional mem0 SDK for head-to-head comparison:

pip install "semvec[mem0]"

Server config — current best

set -a && . ./.env && set +a
SEMVEC_RUN_TOP_K=15 \
SEMVEC_CONTEXT_BUDGET_CHARS=10000 \
SEMVEC_RERANK_MODEL="cross-encoder/ms-marco-MiniLM-L-6-v2" \
SEMVEC_RERANK_FETCH_K=50 \
SEMVEC_RERANK_BATCH=64 \
SEMVEC_HYBRID_BM25=1 \
SEMVEC_BM25_FETCH_K=50 \
semvec serve --host 127.0.0.1 --port 8080 --log-level info &

LOCOMO runs

Single-convo smoke (~6 min, ~$0.50)

.venv/bin/python -u benchmarks/run_locomo.py --conv-id 0 \
  -o "benchmarks/results/locomo_conv26_$(date +%Y%m%d_%H%M%S).json" \
  2>&1 | tee /tmp/locomo.log

Full 10-convo suite (~60 min, ~$6, 1986 QAs)

.venv/bin/python -u benchmarks/run_locomo.py --conv-id -1 \
  -o "benchmarks/results/locomo_FULL_$(date +%Y%m%d_%H%M%S).json" \
  2>&1 | tee /tmp/locomo.log

LLM-as-Judge re-evaluation

The runner can re-score predictions with the mem0 paper's judge prompt (verbatim). Use this to make semvec numbers directly comparable to mem0 / Zep / Letta numbers reported in the LOCOMO literature.

Re-judge a finished run in place (most common — adds a judge column to an existing predictions file):

.venv/bin/python -u benchmarks/run_locomo.py --judge \
  -o benchmarks/results/locomo_FULL_20260513_091500.json \
  --judge-model openai/gpt-4o-mini

Re-judge with the stand-alone helper (when you want a separate output file):

.venv/bin/python -u benchmarks/run_locomo_judge.py \
  -i benchmarks/results/locomo_FULL_20260513_091500.json \
  -o benchmarks/results/locomo_judge_$(date +%Y%m%d_%H%M%S).json \
  --judge-model openai/gpt-4o-mini

The judge resolves credentials via JUDGE_OPENAI_* env vars (falls back to OPENAI_* if the judge variables are unset). The OpenAI-compatible adapter is requests-backed — no openai SDK dependency is required, so pip install "semvec[benchmarks]" is sufficient to run both the bench and the judge. .env lookup uses find_dotenv() and walks up from the runner's CWD, so the judge works cleanly from worktrees and sub-directories, not just the repo root.

Reading results

The result JSON contains per-QA entries with question, gold, pred, category, and f1. A typical aggregator:

import json
d = json.loads(open("benchmarks/results/locomo_FULL_*.json").read())
for c in d["convos"]:
    print(c["sample_id"], c["overall_f1"], c["n_qa"])
weighted = sum(c["overall_f1"] * c["n_qa"] for c in d["convos"]) \
         / sum(c["n_qa"] for c in d["convos"])
print(f"Weighted F1: {weighted:.4f}")

Reproducibility checklist

  1. Fresh session per conversation. semvec.create_session() is not a top-level helper in the installed wheel — the bench driver uses the REST surface against a running semvec serve:
snippet — requires a running `semvec serve` on :8080 with SEMVEC_LICENSE_KEY in scope
import httpx, uuid

client = httpx.Client(
    base_url="http://127.0.0.1:8080/v1",
    headers={"X-API-Key": SEMVEC_LICENSE_KEY},
    timeout=httpx.Timeout(60.0),
)
sid = str(uuid.uuid4())
try:
    client.post("/session/create", json={"session_id": sid})
    # ... replay turns: client.post("/store", json={"session_id": sid, ...})
    # ... evaluate:    client.post("/run",   json={"session_id": sid, ...})
finally:
    client.delete(f"/session/{sid}")

The LOCOMO driver scripts (benchmarks/run_locomo.py, benchmarks/run_locomo_judge.py) live in the semvec source repo under benchmarks/, not in the PyPI wheel. Clone the repo to run them; pip install "semvec[benchmarks]" only pulls the runtime dependencies they need (httpx, datasets, scikit-learn). No cross-conversation memory carry-over. 2. temperature=0 on every LLM call. Note: with gpt-4o via OpenRouter there is still ~40 % per-QA stochasticity from provider routing; this nets out to ≤ ±0.5 pp drift on the aggregate. 3. Same embedder for replay and queries (paraphrase-multilingual-mpnet-base-v2, 768d). 4. Hybrid env vars pinned as above. Default-off — set SEMVEC_HYBRID_BM25=1 explicitly. 5. Cross-encoder model pinned (ms-marco-MiniLM-L-6-v2).

Wall-clock budget for the full suite: ~60 minutes, ~$6 of gpt-4o OpenRouter credit. The replay phase dominates (~3 minutes per 500-turn convo).