Skip to content

Benchmarks API (semvec.benchmarks.longmemeval)

Production port of the LongMemEval harness from pss. Covers data loading, single-PSS + multi-PSS runners, LLM-as-judge scoring, and aggregate reporting.

CLI

python -m semvec.benchmarks.longmemeval \
    --variant S --multi-pss --temperature 0.0 \
    --embed-device cuda \
    --per-type 10 --n-judges 3 \
    --output results/semvec_full.json
Flag Default Purpose
--variant {S,M,oracle} S Dataset variant (S = ~40 sessions, M = ~500).
--local-file PATH auto-download Use a local JSON file instead of HuggingFace.
--max-entries N all Cap on total entries.
--skip-entries N 0 Skip first N (resume a crashed run).
--per-type N none Take N per question type (balanced sampling).
--question-types T1 T2 ... all Filter by type.
--output PATH none Incremental JSON results.
--provider {openai,ollama} openai Generation LLM.
--judge-provider {openai,ollama} openai Judge LLM (reads JUDGE_OPENAI_* vars).
--temperature 0.0–2.0 0.0 LLM temperature for generation AND judge.
--n-judges N 1 Judge ensemble size (majority vote).
--embed-model all-MiniLM-L6-v2 SentenceTransformer model.
--embed-device {cpu,cuda,mps} cpu Torch device.
--multi-pss off Use MultiPSSRunner (3-way PSS: user / assistant / QA).

Output JSON

{
  "summary": {
    "total_entries": 30,
    "pss_accuracy": 0.13,
    "baseline_accuracy": 0.07,
    "total_pss_tokens": 12424,
    "total_baseline_tokens": 3665963,
    "savings_pct": 99.7,
    "per_type": { "...": "..." },
    "total_memory_system_calls": 30,
    "total_memory_system_tokens": 12424,
    "total_baseline_calls": 30,
    "total_baseline_full_tokens": 3665963,
    "total_wall_clock_seconds": 537.3,
    "avg_wall_clock_seconds_per_entry": 17.9
  },
  "results": [ {per-entry } ]
}

Why pss_accuracy: 0.13 here but 40.8% in Parity envelope?

The JSON snippet above is from a balanced 30-entry smoke run (--per-type 10 picks 10 entries each from three question types — a small, deliberately hard sub-sample). The 40.8% figure in the parity page is the full 500-entry LongMemEval-S run. Both results are real and come from the same harness; they measure different sample sizes, not different correctness.

LongMemEvalRunner

Single-PSS runner. One SemvecState per entry.

from semvec.benchmarks.longmemeval import LongMemEvalRunner

runner = LongMemEvalRunner(
    dataset=ds,
    llm=llm,
    embedder=embedder,
    pss_config_dimension=384,
    serializer_config=None,
    system_prompt="You are a helpful assistant with memory of past conversations.",
)
result = runner.run_entry(dataset.entries[0])   # EntryResult
results = runner.run_all()                       # list[EntryResult]

MultiPSSRunner

Three-way PSS: one for user turns, one for assistant turns, one for combined Q&A clusters. Temporal prefixes [YYYY-MM-DD] are prepended to every chunk.

from semvec.benchmarks.longmemeval import MultiPSSRunner

runner = MultiPSSRunner(
    dataset=ds,
    llm=llm,
    embedder=embedder,
    pss_config_dimension=384,
    top_k_per_instance=5,
    max_memory_chars=500,
)

EntryResult

@dataclass
class EntryResult:
    question_id: str
    question_type: str
    question: str
    ground_truth: str
    pss_response: str
    baseline_response: str
    pss_query_tokens: int
    baseline_query_tokens: int
    sessions_ingested: int
    pss_total_calls: int
    baseline_total_calls: int
    pss_total_tokens: int
    baseline_total_tokens: int
    wall_clock_seconds: float

GroundTruthEvaluator

LLM-as-judge with A/B comparison.

from semvec.benchmarks.longmemeval import GroundTruthEvaluator

evaluator = GroundTruthEvaluator(judge_llm=judge, n_judges=3)
eval_result = evaluator.evaluate_entry(entry_result)       # EvalResult
eval_results = evaluator.evaluate_all(runner.run_all())

Plus a single-response PASS/FAIL convenience:

verdict = evaluator.evaluate(
    question="…",
    correct_answer="…",
    candidate_response="…",
)
# SimpleEvalVerdict(passed=True|False, explanation="…")

With n_judges > 1, each correctness flag uses majority vote; ties resolve to False (conservative).

BenchmarkSummary

from semvec.benchmarks.longmemeval import BenchmarkSummary

summary = BenchmarkSummary.from_results(eval_results)
summary.to_dict()       # JSON-safe

Data loading

  • load_from_file(path, variant="S") -> LongMemEvalDataset — parse a local JSON file.
  • load_dataset(variant="S") -> LongMemEvalDataset — download from HuggingFace if not cached locally, then parse.
  • download_dataset(variant) -> Path — just the HuggingFace download step.

Schema

from semvec.benchmarks.longmemeval import (
    LongMemEvalDataset,
    LongMemEvalEntry,
    Session,
)
  • Session(session_id, turns, has_answer=False, session_date="")approx_tokens, as_text().
  • LongMemEvalEntry — 7 fields; total_sessions, evidence_sessions, approx_total_tokens.
  • LongMemEvalDataset(variant, entries)question_types, filter(question_types, max_entries, per_type).

LLMCallCounter

Proxy that tracks calls and token usage for any callable LLM client.

from semvec.benchmarks.longmemeval import LLMCallCounter

counter = LLMCallCounter(base_llm)
counter(messages)               # call
counter.total_calls             # int
counter.total_prompt_tokens     # int
counter.total_completion_tokens # int
with counter.scope() as s:
    counter(messages_a)
    counter(messages_b)
    print(s.calls, s.prompt_tokens)

What is NOT ported

Mem0Runner requires the external mem0ai package and was deliberately dropped from the wheel. If you need a Mem0 baseline, install mem0ai + faiss-cpu separately and implement a runner against the same EntryResult shape — the GroundTruthEvaluator will accept it unchanged.