Skip to content

Detecting updates vs. new information (DedupSignal)

"How do I know whether this incoming fact is something we already have — without burning an LLM call on every turn?"

When you put Semvec in front of an existing RAG / agent / ingest pipeline, you usually need a fast, deterministic yes/no on:

  • Has the user already told us this?
  • Is this a refinement of something we stored?
  • Or is it genuinely new information that should flow into the downstream RAG / vector store / knowledge base?

The dedup_signal is the read-only hint Semvec returns on every SemvecState.update() call to answer that. The caller decides what to do with the answer.

What it is — and what it isn't

It is It isn't
Read-only, informational, attached to every update result. An action — Semvec does not skip, merge, replace, or tombstone anything.
A boolean (is_update) derived from max_sim > threshold, with the raw similarity exposed. A contradiction detector — "Hauptsitz München" vs. "Hauptsitz Berlin" have similar embeddings; both register as updates.
Cheap: re-uses similarities Semvec already computes in the update path — zero extra cosine work. An LLM judge — no model call, no extra latency, deterministic.
Threshold-tunable globally and per-call. A one-size-fits-all rule — the right threshold depends on your embedder and content.

Storage stays append-only. The signal tells you what Semvec saw; what you do about it is your call.

The 30-second example

from semvec import SemvecConfig, SemvecState
import numpy as np

state = SemvecState(SemvecConfig(dimension=768))

# Turn 1 — first time this user mentions deployment region.
emb = embedder.get_embedding("We deploy our service in eu-west-1.")
res = state.update(emb, "We deploy our service in eu-west-1.")
print(res["dedup_signal"])
# {'is_update': False, 'max_sim': 0.0, 'matched_id': None}

# Turn 7 — same user, paraphrased.
emb = embedder.get_embedding("Our deployment region is eu-west-1.")
res = state.update(emb, "Our deployment region is eu-west-1.")
print(res["dedup_signal"])
# {'is_update': True, 'max_sim': 0.91, 'matched_id': '019e3f65-dd29-7d90-...'}

What you do with is_update=True is out of Semvec's scope — it's exactly the kind of decision that belongs to your application:

sig = res["dedup_signal"]
if sig["is_update"]:
    # Don't re-index this into our downstream vector store / RAG —
    # we already have it.
    pass
else:
    # Genuinely new — push to the canonical knowledge base.
    canonical_kb.upsert(text, embedding=emb)

Why this exists — the RAG-frontend use case

A common architecture pattern is:

user message
┌──────────────┐                ┌───────────────┐
│  Semvec      │── retrieve ───▶│  LLM /        │
│ (session     │                │  agent        │
│  memory,     │── context  ───▶│               │
│  per-user)   │                └───────────────┘
└──────────────┘                       │
   ▲                                   │ generates / fetches
   │                                   ▼
   └─── new fact?  ─────────  RAG / knowledge base
                              (shared, slower, expensive ingest)

The expensive path is the downstream RAG ingest — chunking, embedding, re-indexing, often a paid generative reformulation step. You want to trigger it only for genuinely new information, not for "the user said something we already learned three turns ago."

Without a signal, every chatbot turn either pays the ingest cost unconditionally (waste) or invents its own ad-hoc dedup heuristic (brittle, drifty). The dedup_signal gives you a deterministic gate keyed off the similarity Semvec already had to compute anyway.

Fields

{
    "is_update": True,                                          # bool
    "max_sim": 0.91,                                            # float in [-1, 1]
    "matched_id": "019e3f65-dd29-7d90-b501-c000bfe6c0df",       # str | None
}
Field Type Meaning
is_update bool max_sim > threshold — likely a duplicate or update of an existing memory.
max_sim float ∈ [-1, 1] Cosine similarity to the closest prior memory. The just-inserted memory is excluded from the comparison so you don't get spurious 1.0 self-matches.
matched_id str | None UUIDv7 of the memory that produced max_sim. None on cold start, when the isolation filter rejected the input, or when the NaN guard tripped.

The matched_id is the stable identifier of the memory — it survives to_dict() / from_dict() round-trips and remains valid across snapshot reloads. Use it to correlate the signal with what Semvec returned from get_relevant_memories().

Threshold tuning

The default is SemvecConfig.dedup_update_threshold = 0.85.

We ship a deliberately conservative default; the right number is workload-dependent. Three knobs:

# 1. Global default at construction time.
cfg = SemvecConfig(dimension=768, dedup_update_threshold=0.9)
state = SemvecState(cfg)

# 2. Per-call override (keyword-only). Takes precedence over the
#    config default for this one call.
res = state.update(emb, text, dedup_threshold=0.95)

# 3. Read the raw similarity and decide outside Semvec — `max_sim`
#    is always exposed, so a downstream component can apply its own
#    policy without re-running anything.
threshold_for_kb_ingest = 0.93
if res["dedup_signal"]["max_sim"] < threshold_for_kb_ingest:
    canonical_kb.upsert(text, embedding=emb)

The empirical way to pick a threshold:

  1. Collect 100–200 pairs from your own production traffic.
  2. Label each as duplicate / update / new.
  3. Sweep the threshold from 0.80 to 0.95 in 0.01 steps.
  4. Pick the threshold that maximises F1 on the duplicate class (or recall, if false-negatives are more expensive for your case).
  5. Re-tune when you change the embedding model — thresholds are model-specific. mpnet-768 and OpenAI text-embedding-3-large produce different similarity distributions for the same content.

REST surface

The same signal rides through the /v1/run REST response:

curl -X POST https://your-host/v1/run \
  -H "Authorization: Bearer $SEMVEC_LICENSE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "u-42", "message": "Our region is eu-west-1.", "response": "Got it."}'
{
  "session_id": "u-42",
  "context": "[Semvec Context …]",
  "top_similarity": 0.91,
  "short_circuit": false,
  "drift_score": 0.0,
  "drift_detected": false,
  "drift_phase": "stable",
  "dedup_signal": {
    "is_update": true,
    "max_sim": 0.91,
    "matched_id": "019e3f65-dd29-7d90-b501-c000bfe6c0df"
  }
}

dedup_signal is optional in the response: it is null when no state update happened on the call (e.g. a /v1/run without a response field, which doesn't trigger a store).

Cold-start, edge cases, what None means

Situation dedup_signal contents
Very first update on a fresh state — no prior memories. {is_update: False, max_sim: 0.0, matched_id: None}
Input failed the isolation filter or NaN guard. {is_update: False, max_sim: 0.0, matched_id: None} (you also get filtered: true in the same dict).
Storage is non-empty, but every prior memory was just evicted out of the retrieval set. Same cold-start placeholder.
/v1/run call without a response field. Top-level dedup_signal is null in the REST response.

If your downstream logic cares about the distinction between "cold start" and "low-sim insert," check matched_id is None — that is the only definitive cold-start signal; max_sim == 0.0 is also achievable from a genuinely orthogonal-to-everything embedding.

Deciding before storing — state.preview_dedup()

The dedup_signal rides along with an update() that has already stored the memory. If your pipeline needs to decide whether to call update() at all — e.g. you don't want to inflate the downstream RAG index with re-statements of facts you already have — use the read-only sibling:

sig = state.preview_dedup(embedding)
if not sig["is_update"]:
    state.update(embedding, text)
else:
    # Caller decides: skip, route to an update queue, merge into the
    # current LLM context as "user is repeating themselves", …
    pass

preview_dedup() is read-only by contract:

  • Re-uses the same retrieval / top-k path that update() would run, so the cost is one similarity sweep — no double-Top-k work.
  • Does not mutate any observable state: no interaction_count bump, no topic_switch_history push, no memory.add(), no protection-score update.
  • Does not consume the per-state rate-limit bucket — safe to poll at high frequency.
  • Accepts the same per-call dedup_threshold= override as update().
  • Returns the same dict shape as the dedup_signal key inside an update() response, so a caller's threshold choice transfers 1:1 between the two paths.

REST: POST /v1/dedup-check with {session_id, text, dedup_threshold?} returns the same DedupSignal payload. Authenticated like /v1/run.

curl -X POST https://your-host/v1/dedup-check \
  -H "Authorization: Bearer $SEMVEC_LICENSE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "u-42", "text": "Our region is eu-west-1."}'
{
  "is_update": true,
  "max_sim": 0.91,
  "matched_id": "019e3f65-dd29-7d90-b501-c000bfe6c0df"
}

What the signal does not do

  • No automatic insert suppression on update(). update() itself remains append-only — the embedded dedup_signal is a flag, not an action. Pair it with preview_dedup() (see above) if you want to gate the store. For "never resurface in retrieval" semantics, see Correcting memories → mechanism #3 (NegativeAttractor).
  • No contradiction detection. "I work at A" → "I work at B" produces a high max_sim, but the signal does not know that one fact replaces the other. That is a caller-side concern; combine it with mechanism #4 (Source/Confidence meta) or your own application logic.
  • No paraphrase-vs-correction distinction. High max_sim means the embeddings look alike, nothing more.
  • No cross-RAG check. The signal only sees memories inside this Semvec state. If your downstream RAG also has the fact, Semvec doesn't know — your application has to wire that cross-check itself.

If you need any of these, layer your own logic on top of the signal. The contract is intentionally narrow so it stays fast and deterministic.

Composition with other Semvec features

The signal is orthogonal to everything else and composes cleanly:

  • Retrieval boost (Recency, Anchor, Trigger): unaffected — the signal is computed before / alongside retrieval scoring, and the retrieved memories themselves are not filtered by it.
  • NegativeAttractor: a memory tagged as a negative attractor can still be the matched_id in a signal — the signal reports what the embedder sees, not what retrieval would surface.
  • Compliance pack (event-store / hard delete): orthogonal. Deleted memories disappear from the retrieval set entirely, so they cannot be the matched_id.
  • Topic-switch detection (Audit-5): independent. A new topic produces a low max_sim against prior memories — the signal will naturally report is_update=False, no special-casing.

Cross-references