Semvec benchmarks — LOCOMO¶
Semvec is benchmarked against the public LOCOMO long-term conversational memory suite (snap-research/locomo), the standard suite mem0, Zep, Letta, and the GPT-4-turbo full-context baseline all report against, which makes head-to-head comparisons unambiguous.
TL;DR — Semvec's value proposition¶
| Axis | Semvec | mem0 (real, head-to-head measured) |
|---|---|---|
| Ingest LLM calls per turn | 0 | 1+ |
| Replay wall-time (LOCOMO conv-44, 675 turns) | ~3 min | ~24.5 min |
| Speedup at ingest | — (baseline) | ~8× slower |
| Context tokens per reader call | ~2,000 | ~2,000–5,000 |
| LOCOMO J non-adv (1540 QAs) | 0.605 | 0.669 (paper) / 0.675 (real) |
Semvec trades ~6 pp J for ~8× faster ingest, zero LLM calls at ingest, and deterministic replay. On token cost per reader call Semvec is competitive or better; on wall-time, cost, and determinism at ingest Semvec wins structurally.
What LOCOMO measures¶
LOCOMO is 10 conversations × 369–689 turns × 105–260 question-answer pairs = 1986 QAs total, split into five question categories:
| Cat | Type | n | What it tests |
|---|---|---|---|
| 1 | single-hop | 282 | One memory contains the answer verbatim. |
| 2 | multi-hop | 321 | Answer requires combining two or more memories. |
| 3 | temporal | 96 | Date-resolution and relative-time reasoning. |
| 4 | open-domain | 841 | Free-form reasoning over the conversation. |
| 5 | adversarial | 446 | Questions whose answer is not in the conversation — the system must say "no information available" instead of guessing. |
Evaluation setup (mem0 1:1)¶
- Reader = Judge model:
openai/gpt-4o-mini, T = 0 - Judge prompt: byte-identical to
mem0ai/mem0/evaluation/metrics/llm_judge.py - LLM-as-Judge accuracy (J): binary CORRECT / WRONG per QA, aggregated per category; Cat 5 excluded (mem0 / Zep / RAG headline convention)
- F1 (5 cats): Porter-stemmed token-F1, ported verbatim from
snap-research/locomo/task_eval/evaluation.py
Headline numbers¶
LLM-as-Judge accuracy (Cat 1-4, n=1540) — mem0's headline metric¶
| Category | n | Semvec | mem0 paper | mem0 real (conv-44 only) |
|---|---|---|---|---|
| single-hop | 282 | 0.582 | 0.671 | 0.633 (conv-44 n=30) |
| multi-hop | 321 | 0.502 | 0.512 | 0.417 (conv-44 n=24) |
| temporal | 96 | 0.469 | 0.555 | 0.286 (conv-44 n=7) |
| open-domain | 841 | 0.667 | 0.729 | 0.839 (conv-44 n=62) |
| OVERALL J non-adv | 1540 | 0.605 | 0.669 | 0.675 (conv-44 n=123) |
The "real measured" column is from a live head-to-head run we did against
mem0ai==2.0.2 (locally installed, same gpt-4o-mini reader + judge, same
prompts) on a single conversation (conv-44). It matches the mem0 paper's
published number within 0.5 pp, confirming the published figures are
reproducible.
Stemmed-F1 (all 5 cats, n=1986) — official LOCOMO eval¶
| Category | n | Semvec |
|---|---|---|
| single-hop | 282 | 0.366 |
| multi-hop | 321 | 0.430 |
| temporal | 96 | 0.264 |
| open-domain | 841 | 0.497 |
| adversarial | 446 | 0.352 |
| OVERALL F1 | 1986 | 0.424 |
Position vs other memory systems (J non-adv, same reader+judge)¶
| Rank | System | J non-adv |
|---|---|---|
| 1 | GPT-4 Full-Ctx 128K | 0.726 |
| 2 | Mem0-graph | 0.683 |
| 3 | Mem0 | 0.669 |
| 4 | RAG (k=5) | 0.611 |
| 5 | Semvec | 0.605 |
| 6 | LangMem | 0.587 |
| 7 | Zep (corrected) | 0.584 |
| 8 | A-Mem | 0.524 |
| 9 | MemoryBank | 0.470 |
| 10 | Letta / MemGPT | 0.408 |
Cost class — Semvec is the only dedicated memory system that ingests without a generative LLM call¶
Of the ten LOCOMO contenders, only two others share this profile: GPT-4 Full-Ctx (no memory system at all — raw context stuffing) and RAG @ k=5 (document retrieval — a different problem class, complementary to Semvec rather than competitive: many users run both side by side). The remaining seven — Mem0, Mem0-graph, LangMem, Zep, A-Mem, MemoryBank, Letta/MemGPT — all run one or more generative LLM passes per stored turn (fact extraction, graph triples, atomic notes, hierarchical summarisation, memory paging).
Semvec lands at rank 5 on J and beats five of those seven (LangMem, Zep, A-Mem, MemoryBank, Letta) at a fraction of their ingest cost.
Pitch line: Mem0-near quality at zero generative-LLM cost at ingest.
Speed & cost — the structural advantage¶
Replay 675 turns of LOCOMO conv-44, then answer 158 QAs:
| Stage | Semvec | mem0 (real, head-to-head) |
|---|---|---|
| Replay (ingest) — 675 turns | ~3 min | ~24.5 min (~8× slower) |
| QA pass — 158 questions | ~2 min | ~3.5 min |
| Total | ~5 min | ~28 min (~5.5× slower end-to-end) |
The ingest gap is structural: mem0 runs a fact-extraction LLM call on
every add(); Semvec writes raw turns into the embedding store at zero
LLM cost. Extrapolated to the full 1986-QA suite the gap widens further —
Semvec's full run completes in ~95 minutes, mem0's would take ~6–8 hours.
Token efficiency — measured live¶
Measured on a live LOCOMO replay (mean across 10 sampled queries on
conv-44 with 100 turns seeded): Semvec's /v1/run context block is
~8.3k chars / ~2.0k tokens — well below the 20k-char ceiling because
top-K=30 reranked memory chunks rarely fill the budget.
| Setup | Context tokens / reader call |
|---|---|
| Full-context replay (avg LOCOMO conv, 544 turns) | ~16,300 |
| Full-context replay (large conv, 689 turns) | ~20,700 |
| mem0 (typical) | ~2,000–5,000 |
| Semvec (measured live) | ~2,000 |
| Savings vs full-context replay | ~8× fewer (5.4–10× by convo size, 87 % reduction) |
Reproducibility¶
Server-side configuration (Semvec 0.6.1):
SEMVEC_HYBRID_BM25=1
SEMVEC_BM25_FETCH_K=100
SEMVEC_RUN_TOP_K=30
SEMVEC_CONTEXT_BUDGET_CHARS=20000
SEMVEC_RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
SEMVEC_RERANK_FETCH_K=100
- Embedder:
paraphrase-multilingual-mpnet-base-v2(768d) - Fresh session per conversation, no memory carry-over
- Reader = Judge =
openai/gpt-4o-mini, T = 0 - Wall-time end-to-end: ~95 min on the full 1986-QA suite
pip install "semvec[benchmarks,hybrid,api]"
.venv/bin/python benchmarks/run_locomo.py --conv-id -1 --judge \
--judge-model openai/gpt-4o-mini \
-o benchmarks/results/locomo_FULL_$(date +%Y%m%d_%H%M%S).json
See also¶
- Running benchmarks — exact reproduce-commands and
.envsetup - Parity envelope — determinism guarantees + drift bounds across replays
- vs mem0 — head-to-head methodology