Semvec vs. mem0¶
mem0 is the closest commercial peer for "agent memory" — both projects sit between an LLM and a vector store and manage long-running conversation context. The bench they both report against is LOCOMO (Maharana et al., 2024).
Architectural differences¶
| Property | Semvec | mem0 |
|---|---|---|
| Ingest LLM calls per turn | 0 — in-process deterministic update (no LLM) | LLM-driven fact extraction (one call per add() to extract atomic facts) |
| Storage form | Raw turns + cosine embedding + per-session BM25 index | Atomic facts extracted by an LLM, stored verbatim |
| Retrieval | Dense cosine + BM25 hybrid fusion, then cross-encoder rerank | Dense + sparse fusion over the fact-store |
| Token-cost behaviour | Constant per turn (no LLM ingest, no growing summary) | Linear in conversation length × extracted-fact density |
| Determinism | Deterministic update — bit-identical replay possible within a release (no LLM stochasticity at ingest) | Each add() is an LLM call, repro requires temperature=0 |
| Default LLM dependency | None for ingest — LLM only for answering | Required end-to-end |
Head-to-head — LOCOMO (10 conversations, 1986 QAs)¶
LOCOMO is the standard suite both projects publish against. Same dataset
version (snap-research/locomo v1), same reader + judge model (gpt-4o-mini,
T = 0), same judge prompt (byte-identical to mem0ai/mem0/evaluation/metrics/llm_judge.py).
LLM-as-Judge accuracy (Cat 1-4, mem0 headline metric)¶
| Category | n | Semvec | Mem0 paper |
|---|---|---|---|
| single-hop | 282 | 0.582 | 0.671 |
| multi-hop | 321 | 0.502 | 0.512 |
| temporal | 96 | 0.469 | 0.555 |
| open-domain | 841 | 0.667 | 0.729 |
| OVERALL J | 1540 | 0.605 | 0.669 |
Stemmed-F1 (all 5 cats, official LOCOMO scoring)¶
| Category | n | Semvec |
|---|---|---|
| single-hop | 282 | 0.366 |
| multi-hop | 321 | 0.430 |
| temporal | 96 | 0.264 |
| open-domain | 841 | 0.497 |
| adversarial | 446 | 0.352 |
| OVERALL F1 | 1986 | 0.424 |
Cost asymmetry — measured live in head-to-head. Mem0's J-edge comes
structurally from its fact-extraction pipeline: an extra LLM pass on every
add() that condenses raw turns into atomic facts before storage. Semvec
runs zero LLM calls at ingest: every turn lands in the embedding
store via pure cosine math. Of the ten LOCOMO contenders Semvec is the
only dedicated memory system in that cost class — the others (LangMem,
Zep, A-Mem, MemoryBank, Letta/MemGPT, plus mem0 / mem0-graph) all run
one or more generative LLM passes per stored turn.
| Stage | Semvec | mem0 (real, head-to-head) |
|---|---|---|
| Replay 675 turns (LOCOMO conv-44 ingest) | ~3 min | ~24.5 min (~8× slower) |
| QA pass — 158 questions | ~2 min | ~3.5 min |
| End-to-end | ~5 min | ~28 min (~5.5× slower) |
| LLM calls per turn at ingest | 0 | 1+ |
Extrapolated to the full 1986-QA suite: Semvec finishes in ~95 minutes, mem0 would take ~6–8 hours.
Token efficiency¶
Measured live on a LOCOMO replay (mean across 10 queries on conv-44 with 100 turns seeded). Semvec's context block typically uses ~8.3k chars / ~2k tokens — well below its 20k-char budget ceiling because top-K=30 reranked memory chunks rarely fill it.
| Setup | Context tokens / reader call |
|---|---|
| Full-context replay (avg LOCOMO conv, 544 turns) | ~16,300 |
| Full-context replay (large conv, 689 turns) | ~20,700 |
| Mem0 (typical) | ~2,000–5,000 |
| Semvec (measured) | ~2,000 |
Both memory systems target the same context-budget problem; Semvec is leaner at ingest (no LLM round-trips), competitive at retrieve, and substantially leaner than full-context replay.
Reproduce¶
Install both stacks:
Run the LOCOMO bench against Semvec with the same judge mem0 uses:
.venv/bin/python -u benchmarks/run_locomo.py --conv-id -1 --judge \
--judge-model openai/gpt-4o-mini \
-o "benchmarks/results/locomo_FULL_$(date +%Y%m%d_%H%M%S).json"
The mem0 SDK is installed via the [mem0] extra so you can wire it up
as a side-by-side baseline in your own harness if needed.
When to pick which¶
Pick Semvec when:
- You can't afford an LLM call on every ingest turn (cost or latency).
- You need deterministic, replayable memory state (audit, compliance).
- You want adversarial-question discipline (Cat 5 F1 = 0.78 out of the box).
- You're embedding into an existing Python stack (Rust core + thin Python API, no managed service required).
Pick mem0 when:
- You want a managed cloud service end-to-end.
- Fact-extraction granularity at ingest is more important than per-turn cost.
- You're already on the mem0 stack and the ingest LLM cost is in budget.
Sources¶
- LOCOMO (Maharana et al., 2024): https://snap-research.github.io/locomo/
- Mem0 paper: https://arxiv.org/html/2504.19413v1
- Reproducing the Semvec number: see Running benchmarks for the exact env-var config and
gpt-4o-minireader + judge settings.