Benchmarks API (semvec.benchmarks.longmemeval)¶
Production port of the LongMemEval harness from pss. Covers data loading, single-PSS + multi-PSS runners, LLM-as-judge scoring, and aggregate reporting.
CLI¶
python -m semvec.benchmarks.longmemeval \
--variant S --multi-pss --temperature 0.0 \
--embed-device cuda \
--per-type 10 --n-judges 3 \
--output results/semvec_full.json
| Flag | Default | Purpose |
|---|---|---|
--variant {S,M,oracle} |
S |
Dataset variant (S = ~40 sessions, M = ~500). |
--local-file PATH |
auto-download | Use a local JSON file instead of HuggingFace. |
--max-entries N |
all | Cap on total entries. |
--skip-entries N |
0 |
Skip first N (resume a crashed run). |
--per-type N |
none | Take N per question type (balanced sampling). |
--question-types T1 T2 ... |
all | Filter by type. |
--output PATH |
none | Incremental JSON results. |
--provider {openai,ollama} |
openai |
Generation LLM. |
--judge-provider {openai,ollama} |
openai |
Judge LLM (reads JUDGE_OPENAI_* vars). |
--temperature 0.0–2.0 |
0.0 |
LLM temperature for generation AND judge. |
--n-judges N |
1 |
Judge ensemble size (majority vote). |
--embed-model |
all-MiniLM-L6-v2 |
SentenceTransformer model. |
--embed-device {cpu,cuda,mps} |
cpu |
Torch device. |
--multi-pss |
off | Use MultiPSSRunner (3-way PSS: user / assistant / QA). |
Output JSON¶
{
"summary": {
"total_entries": 30,
"pss_accuracy": 0.13,
"baseline_accuracy": 0.07,
"total_pss_tokens": 12424,
"total_baseline_tokens": 3665963,
"savings_pct": 99.7,
"per_type": { "...": "..." },
"total_memory_system_calls": 30,
"total_memory_system_tokens": 12424,
"total_baseline_calls": 30,
"total_baseline_full_tokens": 3665963,
"total_wall_clock_seconds": 537.3,
"avg_wall_clock_seconds_per_entry": 17.9
},
"results": [ {per-entry …} ]
}
Why pss_accuracy: 0.13 here but 40.8% in Parity envelope?
The JSON snippet above is from a balanced 30-entry smoke run
(--per-type 10 picks 10 entries each from three question types —
a small, deliberately hard sub-sample). The 40.8% figure in the
parity page is the full 500-entry LongMemEval-S run. Both results
are real and come from the same harness; they measure different
sample sizes, not different correctness.
LongMemEvalRunner¶
Single-PSS runner. One SemvecState per entry.
from semvec.benchmarks.longmemeval import LongMemEvalRunner
runner = LongMemEvalRunner(
dataset=ds,
llm=llm,
embedder=embedder,
pss_config_dimension=384,
serializer_config=None,
system_prompt="You are a helpful assistant with memory of past conversations.",
)
result = runner.run_entry(dataset.entries[0]) # EntryResult
results = runner.run_all() # list[EntryResult]
MultiPSSRunner¶
Three-way PSS: one for user turns, one for assistant turns, one for combined Q&A clusters. Temporal prefixes [YYYY-MM-DD] are prepended to every chunk.
from semvec.benchmarks.longmemeval import MultiPSSRunner
runner = MultiPSSRunner(
dataset=ds,
llm=llm,
embedder=embedder,
pss_config_dimension=384,
top_k_per_instance=5,
max_memory_chars=500,
)
EntryResult¶
@dataclass
class EntryResult:
question_id: str
question_type: str
question: str
ground_truth: str
pss_response: str
baseline_response: str
pss_query_tokens: int
baseline_query_tokens: int
sessions_ingested: int
pss_total_calls: int
baseline_total_calls: int
pss_total_tokens: int
baseline_total_tokens: int
wall_clock_seconds: float
GroundTruthEvaluator¶
LLM-as-judge with A/B comparison.
from semvec.benchmarks.longmemeval import GroundTruthEvaluator
evaluator = GroundTruthEvaluator(judge_llm=judge, n_judges=3)
eval_result = evaluator.evaluate_entry(entry_result) # EvalResult
eval_results = evaluator.evaluate_all(runner.run_all())
Plus a single-response PASS/FAIL convenience:
verdict = evaluator.evaluate(
question="…",
correct_answer="…",
candidate_response="…",
)
# SimpleEvalVerdict(passed=True|False, explanation="…")
With n_judges > 1, each correctness flag uses majority vote; ties resolve to False (conservative).
BenchmarkSummary¶
from semvec.benchmarks.longmemeval import BenchmarkSummary
summary = BenchmarkSummary.from_results(eval_results)
summary.to_dict() # JSON-safe
Data loading¶
load_from_file(path, variant="S") -> LongMemEvalDataset— parse a local JSON file.load_dataset(variant="S") -> LongMemEvalDataset— download from HuggingFace if not cached locally, then parse.download_dataset(variant) -> Path— just the HuggingFace download step.
Schema¶
Session(session_id, turns, has_answer=False, session_date="")—approx_tokens,as_text().LongMemEvalEntry— 7 fields;total_sessions,evidence_sessions,approx_total_tokens.LongMemEvalDataset(variant, entries)—question_types,filter(question_types, max_entries, per_type).
LLMCallCounter¶
Proxy that tracks calls and token usage for any callable LLM client.
from semvec.benchmarks.longmemeval import LLMCallCounter
counter = LLMCallCounter(base_llm)
counter(messages) # call
counter.total_calls # int
counter.total_prompt_tokens # int
counter.total_completion_tokens # int
with counter.scope() as s:
counter(messages_a)
counter(messages_b)
print(s.calls, s.prompt_tokens)
What is NOT ported¶
Mem0Runner requires the external mem0ai package and was deliberately dropped from the wheel. If you need a Mem0 baseline, install mem0ai + faiss-cpu separately and implement a runner against the same EntryResult shape — the GroundTruthEvaluator will accept it unchanged.