Benchmarks API (`semvec.benchmarks.longmemeval`)¶

Production port of the LongMemEval harness from pss. Covers data loading, single-PSS + multi-PSS runners, LLM-as-judge scoring, and aggregate reporting.

CLI¶

python -m semvec.benchmarks.longmemeval \
    --variant S --multi-pss --temperature 0.0 \
    --embed-device cuda \
    --per-type 10 --n-judges 3 \
    --output results/semvec_full.json

Flag	Default	Purpose
`--variant {S,M,oracle}`	`S`	Dataset variant (S = ~40 sessions, M = ~500).
`--local-file PATH`	auto-download	Use a local JSON file instead of HuggingFace.
`--max-entries N`	all	Cap on total entries.
`--skip-entries N`	`0`	Skip first N (resume a crashed run).
`--per-type N`	none	Take N per question type (balanced sampling).
`--question-types T1 T2 ...`	all	Filter by type.
`--output PATH`	none	Incremental JSON results.
`--provider {openai,ollama}`	`openai`	Generation LLM.
`--judge-provider {openai,ollama}`	`openai`	Judge LLM (reads `JUDGE_OPENAI_*` vars).
`--temperature 0.0–2.0`	`0.0`	LLM temperature for generation AND judge.
`--n-judges N`	`1`	Judge ensemble size (majority vote).
`--embed-model`	`all-MiniLM-L6-v2`	SentenceTransformer model.
`--embed-device {cpu,cuda,mps}`	`cpu`	Torch device.
`--multi-pss`	off	Use `MultiPSSRunner` (3-way PSS: user / assistant / QA).

Output JSON¶

{
  "summary": {
    "total_entries": 30,
    "pss_accuracy": 0.13,
    "baseline_accuracy": 0.07,
    "total_pss_tokens": 12424,
    "total_baseline_tokens": 3665963,
    "savings_pct": 99.7,
    "per_type": { "...": "..." },
    "total_memory_system_calls": 30,
    "total_memory_system_tokens": 12424,
    "total_baseline_calls": 30,
    "total_baseline_full_tokens": 3665963,
    "total_wall_clock_seconds": 537.3,
    "avg_wall_clock_seconds_per_entry": 17.9
  },
  "results": [ {per-entry …} ]
}

Why pss_accuracy: 0.13 here but 40.8% in Parity envelope?

The JSON snippet above is from a balanced 30-entry smoke run (--per-type 10 picks 10 entries each from three question types — a small, deliberately hard sub-sample). The 40.8% figure in the parity page is the full 500-entry LongMemEval-S run. Both results are real and come from the same harness; they measure different sample sizes, not different correctness.

`LongMemEvalRunner`¶

Single-PSS runner. One SemvecState per entry.

from semvec.benchmarks.longmemeval import LongMemEvalRunner

runner = LongMemEvalRunner(
    dataset=ds,
    llm=llm,
    embedder=embedder,
    pss_config_dimension=384,
    serializer_config=None,
    system_prompt="You are a helpful assistant with memory of past conversations.",
)
result = runner.run_entry(dataset.entries[0])   # EntryResult
results = runner.run_all()                       # list[EntryResult]

`MultiPSSRunner`¶

Three-way PSS: one for user turns, one for assistant turns, one for combined Q&A clusters. Temporal prefixes [YYYY-MM-DD] are prepended to every chunk.

from semvec.benchmarks.longmemeval import MultiPSSRunner

runner = MultiPSSRunner(
    dataset=ds,
    llm=llm,
    embedder=embedder,
    pss_config_dimension=384,
    top_k_per_instance=5,
    max_memory_chars=500,
)

`EntryResult`¶

@dataclass
class EntryResult:
    question_id: str
    question_type: str
    question: str
    ground_truth: str
    pss_response: str
    baseline_response: str
    pss_query_tokens: int
    baseline_query_tokens: int
    sessions_ingested: int
    pss_total_calls: int
    baseline_total_calls: int
    pss_total_tokens: int
    baseline_total_tokens: int
    wall_clock_seconds: float

`GroundTruthEvaluator`¶

LLM-as-judge with A/B comparison.

from semvec.benchmarks.longmemeval import GroundTruthEvaluator

evaluator = GroundTruthEvaluator(judge_llm=judge, n_judges=3)
eval_result = evaluator.evaluate_entry(entry_result)       # EvalResult
eval_results = evaluator.evaluate_all(runner.run_all())

Plus a single-response PASS/FAIL convenience:

verdict = evaluator.evaluate(
    question="…",
    correct_answer="…",
    candidate_response="…",
)
# SimpleEvalVerdict(passed=True|False, explanation="…")

With n_judges > 1, each correctness flag uses majority vote; ties resolve to False (conservative).

`BenchmarkSummary`¶

from semvec.benchmarks.longmemeval import BenchmarkSummary

summary = BenchmarkSummary.from_results(eval_results)
summary.to_dict()       # JSON-safe

Data loading¶

load_from_file(path, variant="S") -> LongMemEvalDataset — parse a local JSON file.
load_dataset(variant="S") -> LongMemEvalDataset — download from HuggingFace if not cached locally, then parse.
download_dataset(variant) -> Path — just the HuggingFace download step.

Schema¶

from semvec.benchmarks.longmemeval import (
    LongMemEvalDataset,
    LongMemEvalEntry,
    Session,
)

Session(session_id, turns, has_answer=False, session_date="") — approx_tokens, as_text().
LongMemEvalEntry — 7 fields; total_sessions, evidence_sessions, approx_total_tokens.
LongMemEvalDataset(variant, entries) — question_types, filter(question_types, max_entries, per_type).

`LLMCallCounter`¶

Proxy that tracks calls and token usage for any callable LLM client.

from semvec.benchmarks.longmemeval import LLMCallCounter

counter = LLMCallCounter(base_llm)
counter(messages)               # call
counter.total_calls             # int
counter.total_prompt_tokens     # int
counter.total_completion_tokens # int
with counter.scope() as s:
    counter(messages_a)
    counter(messages_b)
    print(s.calls, s.prompt_tokens)

What is NOT ported¶

Mem0Runner requires the external mem0ai package and was deliberately dropped from the wheel. If you need a Mem0 baseline, install mem0ai + faiss-cpu separately and implement a runner against the same EntryResult shape — the GroundTruthEvaluator will accept it unchanged.

Benchmarks API (semvec.benchmarks.longmemeval)¶

CLI¶

Output JSON¶

LongMemEvalRunner¶

MultiPSSRunner¶

EntryResult¶

GroundTruthEvaluator¶

BenchmarkSummary¶