Reproducing the 17× wall-clock vs mem0¶
Drivers live in the source repo, not the wheel
This page documents the methodology for the 17× LLM-as-Judge
wall-clock vs mem0 number quoted in llms.txt and on the
comparison pages. The driver scripts that produced it are in the
semvec source repo under benchmarks/; they are not bundled in
the PyPI wheel. To rerun, clone the repo and follow the steps below.
The 17× figure is not promised on arbitrary hardware — see the
reference platform section.
What "wall-clock" means here¶
End-to-end time from python -u benchmarks/run_locomo.py ... to the
final aggregated F1 + judge JSON on disk, for the full 10-conversation
LOCOMO suite (1986 QAs). Includes:
- Per-conversation replay (every prior turn ingested via
POST /v1/store) - Per-QA retrieval + answer (
POST /v1/runper question) - LLM-as-Judge re-evaluation (
benchmarks/run_locomo_judge.py)
Excluded (kept out of the 17× ratio because they would otherwise mask the comparison):
- LLM provider queueing time outside median — both runs are timed under the same provider routing
semvec servecold-start (the server is pre-warmed before the run)- Embedder model download (warm cache assumed)
Both systems use the same OpenAI-compatible endpoint, the same
gpt-4o snapshot, and the same judge model. The only delta is the
memory layer.
Reference platform¶
The 17× number is reproducible on the following platform. Significant deviation (e.g. CPU embedder, slower disk, cold model cache) shifts the ratio.
| Component | Spec |
|---|---|
| CPU | AMD Ryzen 9 7950X (16C / 32T) |
| RAM | 64 GB DDR5-5600 |
| GPU | NVIDIA RTX 4090 (24 GB), driver 550.x |
| Storage | NVMe SSD (PCIe 4.0), ext4, noatime |
| OS / kernel | Ubuntu 24.04 LTS, kernel 6.6 |
| Python | 3.12.x |
| NumPy | 1.26.x |
| PyTorch | 2.3.x (CUDA 12.1 build) |
| sentence-transformers | 2.7.x (model paraphrase-multilingual-mpnet-base-v2, 768d) |
| semvec | 0.6.0 (PyPI wheel) |
| mem0 SDK | 0.1.x (latest at audit date 2026-05-13) |
| LLM provider | OpenRouter, region EU-West, model openai/gpt-4o (snapshot pinned via OpenRouter) |
| Judge model | openai/gpt-4o-mini |
| Dataset | LOCOMO public release (10 conversations, 1986 QAs); SHA-256 of the bundled locomo10.json recorded in the result JSON |
| semvec repo commit | recorded as git rev-parse HEAD in the result JSON header |
| mem0 commit | recorded as pip show mem0 version + pip freeze snapshot in the result JSON header |
Benchmark run was on 0.6.0; the behaviour-unchanged release 0.6.1 (see the changelog) should reproduce the same numbers — re-run pending.
Setup¶
git clone https://github.com/<semvec-source-repo>.git
cd semvec
python -m venv .venv && source .venv/bin/activate
pip install -e ".[benchmarks,hybrid,api,mem0]"
Populate .env at the repo root per
Running benchmarks, then export it:
Pre-warm the SentenceTransformer cache (the model download is not part of the wall-clock):
.venv/bin/python -c "from sentence_transformers import SentenceTransformer; \
SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')"
Semvec run¶
Start semvec serve with the production-tuned env vars from the
running guide:
SEMVEC_RUN_TOP_K=15 \
SEMVEC_CONTEXT_BUDGET_CHARS=10000 \
SEMVEC_RERANK_MODEL="cross-encoder/ms-marco-MiniLM-L-6-v2" \
SEMVEC_RERANK_FETCH_K=50 \
SEMVEC_RERANK_BATCH=64 \
SEMVEC_HYBRID_BM25=1 \
SEMVEC_BM25_FETCH_K=50 \
.venv/bin/semvec serve --host 127.0.0.1 --port 8080 --log-level info &
SERVE_PID=$!
# wait for /v1/health to come up
until curl -fsS http://127.0.0.1:8080/v1/health >/dev/null; do sleep 1; done
# clock the full suite
RUN_TS=$(date +%Y%m%d_%H%M%S)
time .venv/bin/python -u benchmarks/run_locomo.py --conv-id -1 \
-o "benchmarks/results/semvec_FULL_${RUN_TS}.json" \
2>&1 | tee "/tmp/semvec_${RUN_TS}.log"
# judge re-eval (same convention as the comparison ratio)
time .venv/bin/python -u benchmarks/run_locomo.py --judge \
-o "benchmarks/results/semvec_FULL_${RUN_TS}.json" \
--judge-model openai/gpt-4o-mini \
2>&1 | tee "/tmp/semvec_judge_${RUN_TS}.log"
kill -TERM $SERVE_PID && wait $SERVE_PID
Expected wall-clock on the reference platform: ~60 minutes for the full suite (replay + retrieve + answer), ~2 minutes for the judge pass. Range observed across five runs: 55–68 minutes.
mem0 run¶
Identical dataset, identical LLM endpoint:
RUN_TS=$(date +%Y%m%d_%H%M%S)
time .venv/bin/python -u benchmarks/run_locomo_mem0.py --conv-id -1 \
-o "benchmarks/results/mem0_FULL_${RUN_TS}.json" \
2>&1 | tee "/tmp/mem0_${RUN_TS}.log"
time .venv/bin/python -u benchmarks/run_locomo_judge.py \
-i "benchmarks/results/mem0_FULL_${RUN_TS}.json" \
-o "benchmarks/results/mem0_judge_${RUN_TS}.json" \
--judge-model openai/gpt-4o-mini \
2>&1 | tee "/tmp/mem0_judge_${RUN_TS}.log"
Expected wall-clock on the same hardware: ~1000 minutes (~16.7 h) for the full suite. mem0 issues an LLM call per ingested turn for its memory-extraction step; replay of ~5000 turns therefore dominates. Range observed: 950–1080 minutes.
Ratio at audit date: 1000 / 60 ≈ 16.7×, reported as 17× rounded.
What counts toward the ratio¶
Both numbers are end-to-end wall-clock of the same run_locomo*.py
invocation. The numerator (mem0) and denominator (semvec) include:
- Replay: ingest every prior conversation turn
- Per-QA retrieve + answer
- JSON write + judge step
They do not include pip install, model download, server boot,
or analysis overhead.
Caveats that affect the ratio¶
- Provider routing stochasticity. OpenRouter's
gpt-4orouting shifts latency by ±20 % across regions. Both runs must use the same region / time window. Drift of ±2× on the ratio across providers is not unexpected — pin a single provider for any published number. - GPU vs CPU embedder. On CPU-only hardware semvec's wall-clock rises 3–5× because the cross-encoder rerank dominates. mem0's wall-clock is mostly LLM-bound, so the ratio compresses from 17× to ~5× on a CPU-only box. The 17× figure is for the reference GPU platform.
- mem0 SDK version. mem0 has changed its memory-extraction prompt
several times during 2025–2026. Always log
pip show mem0next to the run. - Hybrid BM25 enabled. With
SEMVEC_HYBRID_BM25=0semvec drops to ~50 min (no quality impact on this metric; the BM25 path adds CPU overhead per/v1/run). The ratio rises to ~20×.
Recording the result¶
The result JSON header records:
- semvec version (
semvec.__version__) - mem0 version
pip freezeoutput- Hardware (
platform.uname(),torch.cuda.get_device_name(0)) - Run start / end timestamps (UTC)
- LOCOMO dataset SHA-256
- Provider + model snapshot identifier (from the provider's response
header
openai-versionor equivalent)
When citing the 17× number externally, attach the result JSON so the reader can verify the platform and provider match. The number is not a guarantee on arbitrary hardware — it is a reproducible measurement on the platform documented here.