2026-05-13">
Skip to content

Reproducing the 17× wall-clock vs mem0

Drivers live in the source repo, not the wheel

This page documents the methodology for the 17× LLM-as-Judge wall-clock vs mem0 number quoted in llms.txt and on the comparison pages. The driver scripts that produced it are in the semvec source repo under benchmarks/; they are not bundled in the PyPI wheel. To rerun, clone the repo and follow the steps below. The 17× figure is not promised on arbitrary hardware — see the reference platform section.

What "wall-clock" means here

End-to-end time from python -u benchmarks/run_locomo.py ... to the final aggregated F1 + judge JSON on disk, for the full 10-conversation LOCOMO suite (1986 QAs). Includes:

  • Per-conversation replay (every prior turn ingested via POST /v1/store)
  • Per-QA retrieval + answer (POST /v1/run per question)
  • LLM-as-Judge re-evaluation (benchmarks/run_locomo_judge.py)

Excluded (kept out of the 17× ratio because they would otherwise mask the comparison):

  • LLM provider queueing time outside median — both runs are timed under the same provider routing
  • semvec serve cold-start (the server is pre-warmed before the run)
  • Embedder model download (warm cache assumed)

Both systems use the same OpenAI-compatible endpoint, the same gpt-4o snapshot, and the same judge model. The only delta is the memory layer.

Reference platform

The 17× number is reproducible on the following platform. Significant deviation (e.g. CPU embedder, slower disk, cold model cache) shifts the ratio.

Component Spec
CPU AMD Ryzen 9 7950X (16C / 32T)
RAM 64 GB DDR5-5600
GPU NVIDIA RTX 4090 (24 GB), driver 550.x
Storage NVMe SSD (PCIe 4.0), ext4, noatime
OS / kernel Ubuntu 24.04 LTS, kernel 6.6
Python 3.12.x
NumPy 1.26.x
PyTorch 2.3.x (CUDA 12.1 build)
sentence-transformers 2.7.x (model paraphrase-multilingual-mpnet-base-v2, 768d)
semvec 0.6.0 (PyPI wheel)
mem0 SDK 0.1.x (latest at audit date 2026-05-13)
LLM provider OpenRouter, region EU-West, model openai/gpt-4o (snapshot pinned via OpenRouter)
Judge model openai/gpt-4o-mini
Dataset LOCOMO public release (10 conversations, 1986 QAs); SHA-256 of the bundled locomo10.json recorded in the result JSON
semvec repo commit recorded as git rev-parse HEAD in the result JSON header
mem0 commit recorded as pip show mem0 version + pip freeze snapshot in the result JSON header

Benchmark run was on 0.6.0; the behaviour-unchanged release 0.6.1 (see the changelog) should reproduce the same numbers — re-run pending.

Setup

git clone https://github.com/<semvec-source-repo>.git
cd semvec
python -m venv .venv && source .venv/bin/activate
pip install -e ".[benchmarks,hybrid,api,mem0]"

Populate .env at the repo root per Running benchmarks, then export it:

set -a && . ./.env && set +a

Pre-warm the SentenceTransformer cache (the model download is not part of the wall-clock):

.venv/bin/python -c "from sentence_transformers import SentenceTransformer; \
  SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')"

Semvec run

Start semvec serve with the production-tuned env vars from the running guide:

SEMVEC_RUN_TOP_K=15 \
SEMVEC_CONTEXT_BUDGET_CHARS=10000 \
SEMVEC_RERANK_MODEL="cross-encoder/ms-marco-MiniLM-L-6-v2" \
SEMVEC_RERANK_FETCH_K=50 \
SEMVEC_RERANK_BATCH=64 \
SEMVEC_HYBRID_BM25=1 \
SEMVEC_BM25_FETCH_K=50 \
.venv/bin/semvec serve --host 127.0.0.1 --port 8080 --log-level info &
SERVE_PID=$!

# wait for /v1/health to come up
until curl -fsS http://127.0.0.1:8080/v1/health >/dev/null; do sleep 1; done

# clock the full suite
RUN_TS=$(date +%Y%m%d_%H%M%S)
time .venv/bin/python -u benchmarks/run_locomo.py --conv-id -1 \
  -o "benchmarks/results/semvec_FULL_${RUN_TS}.json" \
  2>&1 | tee "/tmp/semvec_${RUN_TS}.log"

# judge re-eval (same convention as the comparison ratio)
time .venv/bin/python -u benchmarks/run_locomo.py --judge \
  -o "benchmarks/results/semvec_FULL_${RUN_TS}.json" \
  --judge-model openai/gpt-4o-mini \
  2>&1 | tee "/tmp/semvec_judge_${RUN_TS}.log"

kill -TERM $SERVE_PID && wait $SERVE_PID

Expected wall-clock on the reference platform: ~60 minutes for the full suite (replay + retrieve + answer), ~2 minutes for the judge pass. Range observed across five runs: 55–68 minutes.

mem0 run

Identical dataset, identical LLM endpoint:

RUN_TS=$(date +%Y%m%d_%H%M%S)
time .venv/bin/python -u benchmarks/run_locomo_mem0.py --conv-id -1 \
  -o "benchmarks/results/mem0_FULL_${RUN_TS}.json" \
  2>&1 | tee "/tmp/mem0_${RUN_TS}.log"

time .venv/bin/python -u benchmarks/run_locomo_judge.py \
  -i "benchmarks/results/mem0_FULL_${RUN_TS}.json" \
  -o "benchmarks/results/mem0_judge_${RUN_TS}.json" \
  --judge-model openai/gpt-4o-mini \
  2>&1 | tee "/tmp/mem0_judge_${RUN_TS}.log"

Expected wall-clock on the same hardware: ~1000 minutes (~16.7 h) for the full suite. mem0 issues an LLM call per ingested turn for its memory-extraction step; replay of ~5000 turns therefore dominates. Range observed: 950–1080 minutes.

Ratio at audit date: 1000 / 60 ≈ 16.7×, reported as 17× rounded.

What counts toward the ratio

Both numbers are end-to-end wall-clock of the same run_locomo*.py invocation. The numerator (mem0) and denominator (semvec) include:

  • Replay: ingest every prior conversation turn
  • Per-QA retrieve + answer
  • JSON write + judge step

They do not include pip install, model download, server boot, or analysis overhead.

Caveats that affect the ratio

  • Provider routing stochasticity. OpenRouter's gpt-4o routing shifts latency by ±20 % across regions. Both runs must use the same region / time window. Drift of ±2× on the ratio across providers is not unexpected — pin a single provider for any published number.
  • GPU vs CPU embedder. On CPU-only hardware semvec's wall-clock rises 3–5× because the cross-encoder rerank dominates. mem0's wall-clock is mostly LLM-bound, so the ratio compresses from 17× to ~5× on a CPU-only box. The 17× figure is for the reference GPU platform.
  • mem0 SDK version. mem0 has changed its memory-extraction prompt several times during 2025–2026. Always log pip show mem0 next to the run.
  • Hybrid BM25 enabled. With SEMVEC_HYBRID_BM25=0 semvec drops to ~50 min (no quality impact on this metric; the BM25 path adds CPU overhead per /v1/run). The ratio rises to ~20×.

Recording the result

The result JSON header records:

  • semvec version (semvec.__version__)
  • mem0 version
  • pip freeze output
  • Hardware (platform.uname(), torch.cuda.get_device_name(0))
  • Run start / end timestamps (UTC)
  • LOCOMO dataset SHA-256
  • Provider + model snapshot identifier (from the provider's response header openai-version or equivalent)

When citing the 17× number externally, attach the result JSON so the reader can verify the platform and provider match. The number is not a guarantee on arbitrary hardware — it is a reproducible measurement on the platform documented here.