In-Process Library (No Server)¶

Semvec ships as a pure Python library — you don't need the REST API server or any external service. Construct a SemvecSession directly, bring your own embedder, and drive the full per-turn loop in your application process.

This page covers:

When to use in-process vs. REST API
SemvecSession — the library facade
Bringing your own embedder
Running a single turn (run_sync(), await run())
Reading results (TurnResult)
Lower-level APIs (state updates, retrieval, triggers, context)

When to use in-process¶

Scenario	Use	Reason
Tight latency requirement	In-process	No network hop. State stays in memory.
Single-process Python app	In-process	Simplest setup. No separate daemon.
Multi-agent in same process	In-process Cortex	Built-in coordination, no REST.
Polyglot integrations (Node, Go, Rust)	REST API	Language-agnostic endpoints.
Distributed multi-machine	REST API	Shared state across deployments.
Serverless / function-as-service	In-process	Ephemeral state per invocation OK.

For the full decision tree, see Choose your path.

Quick start¶

1. Install semvec and an embedder¶

pip install semvec sentence-transformers

2. Construct a session¶

from semvec import SemvecSession, SemvecState, SemvecConfig
from sentence_transformers import SentenceTransformer
import numpy as np

# Create your embedder
class STEmbedder:
    def __init__(self, name: str = "all-MiniLM-L6-v2"):
        self._model = SentenceTransformer(name)
        self._dim = int(self._model.get_sentence_embedding_dimension() or 384)

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str) -> np.ndarray:
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        vec = self._model.encode(text, normalize_embeddings=True, convert_to_numpy=True)
        return np.asarray(vec, dtype=np.float64)

# Create session
embedder = STEmbedder(name="all-MiniLM-L6-v2")
config = SemvecConfig(dimension=384)
session = SemvecSession(
    pss_state=SemvecState(config=config),
    embedder=embedder,
    config=config,
)

3. Run one turn (synchronous)¶

# Synchronous — call from normal Python code
result = session.run_sync("How does the deploy pipeline work?")

print(f"Context: {result.context}")
print(f"Drift phase: {result.drift_phase}")
print(f"Short-circuit: {result.short_circuit}")

4. Run one turn (async)¶

import asyncio

async def chat():
    # Async — call from async code
    result = await session.run(
        message="How does the deploy pipeline work?",
        response="Here's what I know: [...]",  # optional previous LLM response
    )
    return result

# From sync code:
result = asyncio.run(chat())

SemvecSession API overview¶

Constructor¶

signature — keyword-only args follow the *

from semvec import SemvecSession

session = SemvecSession(
    pss_state: SemvecState,
    embedder: Any,  # your BYOE
    config: SemvecConfig,
    *,
    use_cortex: bool = False,
    chat_proxy: Any = None,
    pending_message: str | None = None,
    owner_subject: str | None = None,
    enable_bm25: bool | None = None,
    bm25_rebuild_every: int | None = None,
    auto_extract: bool | None = None,
    auto_extract_broad: bool | None = None,
    auto_anchor_from_extract: bool | None = None,
)

Parameters:

pss_state — the underlying semantic state (from SemvecState(config=...))
embedder — your embedder, matching EmbedderProtocol (below)
config — the SemvecConfig instance
use_cortex — enable cross-session Cortex aggregation (multi-embedder blend); default False
enable_bm25 — enable BM25 hybrid retrieval (lexical boost); default auto from env
owner_subject — optional subject ID for provenance tracking (e.g., in compliance pack)
bm25_rebuild_every — rebuild the BM25 index at most every N writes (in-process equivalent of SEMVEC_BM25_REBUILD_EVERY); None defers to that env var (default 64)
auto_extract — extract verbatim facts (numbers, dates, identifiers) into the literal cache on every write, so they survive embedding lossiness and stay grep-retrievable. In-process equivalent of SEMVEC_AUTO_EXTRACT. None defers to the env var (default off); an explicit True/False overrides the env
auto_extract_broad — also extract broader surface tokens (tech names, version strings); in-process equivalent of SEMVEC_AUTO_EXTRACT_BROAD. None defers to the env var; explicit value overrides. Only takes effect when auto_extract is on
auto_anchor_from_extract — additionally register each extracted token as a drift anchor (boosts scores of nearby memories); in-process equivalent of SEMVEC_AUTO_ANCHOR_FROM_EXTRACT. None defers to the env var; explicit value overrides. Requires an embedder and auto_extract_broad

Parameter always wins over the environment variable

Every None-defaulted knob above follows the same rule: None (the default) reads the matching SEMVEC_* environment variable so REST and in-process behave identically; passing an explicit value overrides the env var for that session. So auto_extract=False disables extraction even when SEMVEC_AUTO_EXTRACT=1 is set in the environment.

The turn loop: `run()` and `run_sync()`¶

Every parameter except message is keyword-only (the * below):

signature — keyword-only args follow the *

# Async signature
async def run(
    self,
    message: str,
    *,
    response: str | None = None,  # previous LLM response (optional)
    top_k: int = 5,
    short_circuit_threshold: float = 0.85,
    mmr_fetch_k: int = 0,
    mmr_lambda: float = 0.5,
    bm25_fetch_k: int = 50,
    rrf_k: int | None = None,
    rrf_weights: list[float] | None = None,
    reranker: Callable[[str, list], list] | None = None,
) -> TurnResult

# Sync wrapper — same signature; cannot be called from within a running event loop
def run_sync(self, message: str, *, response=None, ...) -> TurnResult

# Usage
result = await session.run("the user's message", response="the previous answer")
result = session.run_sync("the user's message")

Both methods perform the same per-turn orchestration:

Embed the new message and optional response in parallel
If response is given, store it immediately (with its precomputed embedding)
Retrieve the top relevant memories (with optional BM25 fusion and reranking)
Compute short-circuit (is this query identical to a stored memory?)
Compute drift (how far has semantic context shifted?)
Buffer the new message for the next turn
Render a context block (retrieval-based summary)
Return a TurnResult

Parameters:

message — the new user input
response — the LLM's previous output (optional; stored if provided)
top_k — how many memories to retrieve
short_circuit_threshold — cosine cutoff for "this is a duplicate query"
mmr_fetch_k — when >0, fetch this many candidates and apply MMR (diversity reranking)
mmr_lambda — MMR balance: 1.0 = pure diversity, 0.0 = pure relevance
bm25_fetch_k — how many BM25 (lexical) hits to fuse with dense results (needs enable_bm25=True + pip install "semvec[hybrid]")
rrf_k — Reciprocal-Rank-Fusion constant used when fusing dense + BM25 rankings (in-process equivalent of SEMVEC_RRF_K); None defers to that env var / the built-in default. Only relevant when BM25 is active
rrf_weights — per-ranking weights for RRF fusion, e.g. [1.0, 0.3] (dense, lexical); in-process equivalent of SEMVEC_RRF_WEIGHTS. None defers to the env var. Only relevant when BM25 is active
reranker — optional cross-encoder reranker: (query_text, candidates) -> candidates (reordered). Build one with make_cross_encoder_reranker — see below

Read the result: `TurnResult`¶

run() / run_sync() return a TurnResult — a NamedTuple with eight fields. Only context is required for the basic loop; the rest are signals you can act on.

snippet — reading TurnResult fields; `session` is a live SemvecSession

from semvec import TurnResult

result: TurnResult = session.run_sync("Which markets does iDEAL unlock for us?")

print(result.context)             # str: constant-cost block for the LLM prompt
print(result.top_similarity)      # float: cosine of query vs. top memory
print(result.short_circuit)       # bool: query ≈ an already-answered one?
print(result.drift_score)         # float: 0.0–1.0 how far context shifted
print(result.drift_detected)      # bool: drift_score >= 0.5?
print(result.drift_phase)         # str: "stable" | "shifting" | "drifted"
print(result.dedup_signal)        # dict | None: was the stored answer an update?
print(result.retrieval_error)     # bool: did retrieval fault this turn?

What each field means (example values are the real turn-2 output of the full loop below, captured against all-MiniLM-L6-v2):

Field	Type	Example	Meaning
`context`	`str`	`[Semvec Context \\| Turn 1 \\| 1 memories]…`	The constant-cost block you inject into the LLM prompt. The only field the basic loop needs.
`top_similarity`	`float`	`0.2809`	Cosine similarity of the query to the best-matching memory. High → strong hit in memory. Also drives `short_circuit`.
`short_circuit`	`bool`	`False`	`True` when the query is effectively a duplicate of an already-answered one (`top_similarity` over `short_circuit_threshold`, default 0.85) — you may serve the cached answer and skip the LLM call.
`drift_score`	`float`	`0.3595`	How far the semantic context has shifted this turn, `0.0` (no shift) … `1.0` (maximal).
`drift_detected`	`bool`	`False`	Shortcut for `drift_score >= 0.5` — a hard topic change.
`drift_phase`	`str`	`"shifting"`	Coarse drift bucket: `"stable"` (< 0.3), `"shifting"` (0.3–0.5), `"drifted"` (≥ 0.5). Turn 2 shifted (new iDEAL-markets topic); turn 3 was `"stable"` again.
`dedup_signal`	`dict \\| None`	`{"is_update": False, "max_sim": 0.0, "matched_id": None}`	Informational hint about whether the stored `response=` was a near-duplicate of an existing memory (`is_update`), its similarity (`max_sim`) and the matched memory id. Never changes what is stored.
`retrieval_error`	`bool`	`False`	Diagnostic: `True` if per-turn retrieval faulted and the turn fell back to an empty context. The turn still succeeds.

The full LLM loop¶

semvec sits between your user prompt and your LLM. Per turn you do exactly two things: run() hands you a constant-cost context block to put in the LLM prompt, and you feed the LLM's answer back into the next run() via response=, so it lands in memory.

User prompt
    │
    ▼
session.run_sync(message=prompt, response=<previous answer>)
    │   1. stores the previous LLM answer         (response=)
    │   2. retrieves relevant memories for the new prompt
    │   3. builds a constant-cost context block
    ▼
result.context  ──►  LLM prompt  =  context  +  user prompt
    │
    ▼
LLM  ──►  answer
    │
    └──►  handed to the NEXT run() as response=

result.context stays the same size whether it is turn 3 or turn 30 000 — you never resend the growing transcript.

snippet — call_llm() is your LLM; embedder/config/session set up as above

prev_answer = None                       # turn 1 has no previous answer yet
while True:
    user_prompt = input("You: ")

    # 1. prompt -> semvec: store previous answer + get context
    result = session.run_sync(user_prompt, response=prev_answer)

    # 2. semvec -> LLM prompt: constant-cost context + new prompt
    system_prompt = "Conversation memory:\n" + result.context
    answer = call_llm(system=system_prompt, user=user_prompt)
    print("Assistant:", answer)

    # 3. remember the answer for the NEXT run() -> it gets injected there
    prev_answer = answer

Why the answer goes in on the next run(): run() stores response= before retrieval, and it buffers the current message. So when you pass turn 1's answer into turn 2's run(), semvec pairs it with turn 1's buffered question and stores a Q: … A: … memory. You only know the answer to turn 1 after the LLM replied — hence it rides along with turn 2's call.

Running the scripted three-turn conversation from earlier prints the real TurnResult values below (captured against all-MiniLM-L6-v2):

--- Turn 1: run_sync("We currently support SEPA payments but not iDEAL.", response=None)
top_similarity 0.0     drift_phase stable
context: [Semvec Context | Turn 0 | 0 memories]          # nothing stored yet

--- Turn 2: run_sync("Which markets does iDEAL unlock for us?", response=<turn-1 answer>)
top_similarity 0.2809  drift_phase shifting
context: [Semvec Context | Turn 1 | 1 memories]
Relevant context:
  1. [1.00] Q: We currently support SEPA payments but not iDEAL.
            A: Understood: you support SEPA … not yet integrated.

--- Turn 3: run_sync("Remind me which payment method we already support.", response=<turn-2 answer>)
top_similarity 0.4139  drift_phase stable
context: [Semvec Context | Turn 2 | 2 memories]
Relevant context:
  1. [1.00] Q: We currently support SEPA payments but not iDEAL. A: Understood: you support SEPA …
  2. [0.81] Q: Which markets does iDEAL unlock for us?          A: iDEAL is the leading … Dutch market.

The point is turn 3: the new question pulls the SEPA memory back up (top_similarity 0.4139) and your LLM answers from result.context — no transcript resent.

Retrieval quality: cosine vs. hybrid + rerank¶

There are two in-process retrieval paths, and they are not equivalent — this is the single most common cause of "recall is worse than the benchmark suggests":

Path	Retrieval	Gets BM25?	Gets rerank?
`SemvecStateSerializer().serialize(state, query_text=…)` (low-level)	dense cosine kNN only	❌	❌
`SemvecSession.run(…)` / `run_sync(…)` (the turn facade)	dense cosine + optional BM25 fusion + optional cross-encoder rerank + optional MMR	✅ (opt-in)	✅ (opt-in)

If you call the serializer directly (or state.update() + serialize) you get dense cosine only. The BM25-hybrid and cross-encoder-rerank stages — the ones the documented LOCOMO numbers were produced with — live in SemvecSession.run(). Bare cosine will not match those accuracy numbers; hit ordering is the lever, and that is what the reranker supplies.

To get the full retrieval quality in-process, use run() with a reranker and BM25 enabled. The reranker is an injected callable (no environment variable needed) — build it with make_cross_encoder_reranker:

from semvec import (
    SemvecSession, SemvecState, SemvecConfig, make_cross_encoder_reranker,
)

# 1. Build the cross-encoder reranker once (loads the model on construction).
#    Default model = the one the benchmark numbers used.
rerank = make_cross_encoder_reranker("cross-encoder/ms-marco-MiniLM-L-6-v2")

# 2. Enable BM25 on the session (needs `pip install "semvec[hybrid]"`, which
#    pulls bm25s + nltk; without it the BM25 stage silently no-ops).
config  = SemvecConfig(dimension=384)
session = SemvecSession(
    SemvecState(config=config), embedder, config, enable_bm25=True,
)

# 3. Pass the reranker per turn. bm25_fetch_k widens the lexical candidate pool.
result = session.run_sync(
    "Which markets does iDEAL unlock?",
    reranker=rerank,
    bm25_fetch_k=50,
)
context = result.context   # the same context block /v1/run returns

make_cross_encoder_reranker(model_name=…, *, fp16=False, batch_size=64, threads=None) returns a (query_text, candidates) -> candidates callable that reorders the retrieved candidates by joint (query, candidate.text) relevance — the exact precision stage the REST /v1/run path applies via SEMVEC_RERANK_MODEL. It requires sentence-transformers; an empty candidate list or a scoring error falls back to the input order rather than raising.

REST parity

Over REST the same two stages are turned on with environment variables: SEMVEC_HYBRID_BM25=1 (+ pip install "semvec[hybrid]") and SEMVEC_RERANK_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2. See the embedders guide for the full env-var set.

REST environment variable → library parameter map¶

Over REST every tuning knob is an SEMVEC_* environment variable read by the API layer and forwarded into the turn. In-process you set the same knobs as explicit parameters — nothing is REST-only. This table is the complete map for the retrieval/ingest knobs; for each one, None (the library default) reads the env var so REST and in-process match, and an explicit value overrides the env.

REST env var	Library equivalent	Where
`SEMVEC_RUN_TOP_K`	`run(top_k=…)`	`SemvecSession.run()`
`SEMVEC_MMR_FETCH_K`	`run(mmr_fetch_k=…)`	`SemvecSession.run()`
`SEMVEC_MMR_LAMBDA`	`run(mmr_lambda=…)`	`SemvecSession.run()`
`SEMVEC_BM25_FETCH_K`	`run(bm25_fetch_k=…)`	`SemvecSession.run()`
`SEMVEC_RRF_K`	`run(rrf_k=…)`	`SemvecSession.run()`
`SEMVEC_RRF_WEIGHTS`	`run(rrf_weights=…)`	`SemvecSession.run()`
`SEMVEC_HYBRID_BM25`	`SemvecSession(enable_bm25=True)`	constructor
`SEMVEC_BM25_REBUILD_EVERY`	`SemvecSession(bm25_rebuild_every=…)`	constructor
`SEMVEC_RERANK_MODEL`	`run(reranker=make_cross_encoder_reranker(…))`	injected callable
`SEMVEC_AUTO_EXTRACT`	`SemvecSession(auto_extract=…)`	constructor
`SEMVEC_AUTO_EXTRACT_BROAD`	`SemvecSession(auto_extract_broad=…)`	constructor
`SEMVEC_AUTO_ANCHOR_FROM_EXTRACT`	`SemvecSession(auto_anchor_from_extract=…)`	constructor
`SEMVEC_CONTEXT_BUDGET_CHARS`	`SerializerConfig(context_budget_chars=…)`	serializer config

This is the whole point of the library facade

If you find a REST behaviour you cannot reproduce in-process, that is a bug — please report it. Every documented SEMVEC_* retrieval/ingest knob has a first-class parameter here.

Bringing your own embedder¶

Your embedder must implement EmbedderProtocol:

signature — the BYOE structural type you implement

from typing import Protocol
import numpy as np


class EmbedderProtocol(Protocol):
    """Structural type for a Bring-Your-Own-Embedder."""

    def get_dimension(self) -> int: ...

    def get_embedding(self, text: str) -> np.ndarray: ...

Rules:

Return normalized unit vectors (norm = 1.0) — if not, normalize in your wrapper
Return np.ndarray with dtype=np.float64
Return a zero-norm vector (np.zeros(dim)) for empty/whitespace input, not an error
The dimension must match your SemvecConfig(dimension=...)

Example: SentenceTransformers¶

from sentence_transformers import SentenceTransformer
import numpy as np

class STEmbedder:
    def __init__(self, name: str = "all-MiniLM-L6-v2"):
        self._model = SentenceTransformer(name)
        self._dim = int(self._model.get_sentence_embedding_dimension() or 384)

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str) -> np.ndarray:
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        vec = self._model.encode(
            text,
            normalize_embeddings=True,
            convert_to_numpy=True,
            show_progress_bar=False,
        )
        return np.asarray(vec, dtype=np.float64)

Example: OpenAI¶

Needs a key — not part of the offline example suite

This wrapper calls the OpenAI API, so it requires OPENAI_API_KEY and a network connection; it is not among the examples live-tested against the wheel. get_embedding() returns a normalized float64 np.ndarray of length get_dimension() (1536 for text-embedding-3-small).

from openai import OpenAI
import numpy as np

class OpenAIEmbedder:
    def __init__(self, model: str = "text-embedding-3-small"):
        self._client = OpenAI()
        self._model = model
        self._dim = 1536 if "small" in model else 3072

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str) -> np.ndarray:
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        response = self._client.embeddings.create(model=self._model, input=text)
        vec = np.asarray(response.data[0].embedding, dtype=np.float64)
        # OpenAI returns unnormalized; normalize:
        norm = np.linalg.norm(vec)
        return vec / norm if norm > 1e-8 else vec

For more examples and tradeoffs, see Choosing an Embedder.

Lower-level APIs¶

Once you have a SemvecSession, you can drive the full API directly without run().

Store a Q&A pair (without running the full turn loop)¶

snippet — sync + async variants; `session` is a live SemvecSession

# Store a single chunk (e.g., a RAG context block the LLM didn't generate)
result = session.store_qa(
    response="The deploy pipeline uses GitHub Actions for CI/CD.",
)
# result is the update-metrics dict, e.g.:
# {'similarity': 0.1748, 'beta': 0.0802, 'pattern_strength': 1.0028,
#  'fsm': 0.0, 'phase': 'initialization', 'norm': 1.2, 'topic_switch': 0.0,
#  'novelty_score': 0.8879,
#  'dedup_signal': {'is_update': False, 'max_sim': 0.1735, 'matched_id': '019f…'}}

# Or async:
result = await session.store_qa_async(response="...")

store_qa() returns the same per-write metrics dict that state.update() yields: phase (conversation stage), fsm (stability score in [0, 1]), novelty_score, and the dedup_signal hint. matched_id is a per-memory UUID and varies per run.

Compute short-circuit (duplicate detection)¶

import numpy as np

query_embedding = embedder.get_embedding("How does deploy work?")
top_similarity, is_duplicate = session.compute_short_circuit(
    "How does deploy work?",
    threshold=0.85,
    query_embedding=query_embedding,
)
# → (0.6043, False)   # returns (top_similarity: float, is_duplicate: bool)
#   is_duplicate is True only when top_similarity >= threshold (here 0.85).

if is_duplicate:
    print("This query is too similar to a stored memory")

Compute drift (semantic divergence)¶

query_embedding = embedder.get_embedding("What's the new framework?")
drift_score, drift_detected, drift_phase = session.compute_drift(
    "What's the new framework?",
    query_embedding=query_embedding,
)
# → (0.4690, False, 'shifting')
#   (drift_score: float, drift_detected: bool = score >= 0.5, drift_phase: str)

# drift_phase is one of: "stable", "shifting", "drifted"
print(f"Drift phase: {drift_phase}")

Retrieve top memories manually¶

query_embedding = embedder.get_embedding("What do we know about X?")
top_k = 5

memories = session.state.memory.get_relevant_memories(query_embedding, top_k=top_k)
for mem in memories:
    print(f"  [{mem.importance:.3f}] {mem.text[:100]}")
# Prints (memory objects expose .importance and .text), e.g.:
#   [1.000] The deploy pipeline uses GitHub Actions for CI/CD.
#   [0.978] Production database is PostgreSQL 15.
#   [0.938] Cache TTL is 300 seconds.

Get the context block (for injecting into LLM prompt)¶

context = session.context_block(
    query_text="What's the deploy pipeline?",
    top_k=5,
)
# context is a str, e.g.:
#   [Semvec Context | Turn 3 | 3 memories]
#   Relevant context:
#     1. [1.00] The deploy pipeline uses GitHub Actions for CI/CD.
#     2. [0.98] Production database is PostgreSQL 15.
#     3. [0.94] Cache TTL is 300 seconds.

# Paste `context` into your LLM system prompt:
system_prompt = f"""You are a helpful assistant.

## What we remember:
{context}

Answer the user's question based on the above context."""

Phase C operations: Triggers, Anchors, Isolation¶

Add resonance triggers (boost specific topics)¶

# Keyword trigger: boost memories about "bug" in retrieval
trigger_id = session.add_trigger(keyword="bug", threshold=0.8)
# → 1   # add_trigger returns an int trigger id (increments per trigger)

# Embedding trigger: boost memories semantically similar to this vector
trigger_embedding = embedder.get_embedding("database failures")
trigger_id = session.add_trigger(embedding=list(trigger_embedding), threshold=0.8)

# Clear all triggers
session.clear_triggers()

Drift anchors (realign semantic context)¶

# Add an anchor vector: the session will try to realign toward it
anchor_embedding = embedder.get_embedding("We're debugging production issues")
anchor_id = session.add_anchor(list(anchor_embedding))

# Query anchor drift:
scores = session.get_anchor_score()
print(f"Anchor score: {scores['anchor_score']}")
print(f"Remaining realignment: {scores['realignment_remaining']}")
# scores is a dict, e.g.:
#   {'anchor_score': 0.1401, 'anchor_count': 1,
#    'drift_threshold': 0.5, 'realignment_remaining': 0}

Input isolation (block off-topic queries)¶

# Isolation levels: "open" (off), "filter" (drop matching input),
# "quarantine" (hold for review), "lockdown" (block all updates).

# Filter out queries similar to a topic
exclusion_embedding = embedder.get_embedding("social media drama")
session.set_isolation(
    level="filter",
    exclusion_embeddings=[list(exclusion_embedding)],
    similarity_threshold=0.7,
)

# Quarantine queries that fall outside an allowed domain
allowlist_embedding = embedder.get_embedding("software development")
session.set_isolation(
    level="quarantine",
    allowlist_embeddings=[list(allowlist_embedding)],
)

# Release quarantine (if isolation blocked a message)
session.release_quarantine()

Inject synthetic memories¶

# Manually add a memory (e.g., from an external knowledge base)
embedding = embedder.get_embedding("Deployment uses Terraform for IaC")
memory_count = session.inject_memory(
    embedding=list(embedding),
    text="Deployment uses Terraform for IaC",
    tier="long_term",
    importance=0.8,
)
# → 7   # inject_memory returns the new total memory count in the session

Persistence: Export and import state¶

Export¶

export_dict = session.export_state()
# → sorted(export_dict.keys()) == ['checksum', 'state_dict']
# Contains:
# - state_dict: the full semantic state, memory tiers, history
# - checksum: SHA256 over the semantic vector for tampering detection

import json
with open("session_backup.json", "w") as f:
    json.dump(export_dict, f)

Import¶

import json

with open("session_backup.json", "r") as f:
    export_dict = json.load(f)

# Restore into a fresh session
session.import_state(export_dict["state_dict"])

Literal cache (verbatim code facts)¶

For coding agents and compliance workloads, store exact values (variable names, file paths, error messages) that shouldn't be embedded/lossy:

# Store an entity
session.store_entity(
    key="deploy_script_path",
    kind="path",
    value="/opt/app/scripts/deploy.sh",
    context="Used in the production deploy pipeline",
    importance=1.0,
)

# Query by text
entities = session.query_entities(query_text="path", max_results=10)
for e in entities:
    print(f"  [{e['kind']}] {e['key']} = {e['value']}")

# Query all
all_entities = session.query_entities(max_results=100)

# Remove
session.remove_entity(key="deploy_script_path")

Metrics and diagnostics¶

metrics = session.get_metrics()

print(f"Phase: {metrics['phase']}")
print(f"Interactions: {metrics['interaction_count']}")
print(f"Total memories: {metrics['total_memories']}")
print(f"Beta history: {metrics['beta_history']}")
print(f"Phase history: {metrics['phase_history']}")
# metrics is a dict, e.g. (after 3 interactions):
#   {'phase': 'initialization', 'interaction_count': 3, 'total_memories': 7,
#    'beta_history': [0.0829, 0.0782, 0.0802],
#    'similarity_history': [0.0, 0.1479, 0.1748],
#    'fsm_history': [0.0, 0.0, 0.0],
#    'norm_history': [1.2, 1.2, 1.2],
#    'phase_history': []}

Complete example: in-process coding assistant¶

import asyncio
from semvec import SemvecSession, SemvecState, SemvecConfig
from sentence_transformers import SentenceTransformer
import numpy as np

class STEmbedder:
    def __init__(self):
        self._model = SentenceTransformer("all-MiniLM-L6-v2")
        self._dim = 384

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str) -> np.ndarray:
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        vec = self._model.encode(text, normalize_embeddings=True, convert_to_numpy=True)
        return np.asarray(vec, dtype=np.float64)

# Setup
embedder = STEmbedder()
config = SemvecConfig(dimension=384)
session = SemvecSession(
    pss_state=SemvecState(config=config),
    embedder=embedder,
    config=config,
)

# Simulate a multi-turn conversation
turns = [
    ("Can you explain the authentication flow?", "Here's the OAuth2 flow: ..."),
    ("What rate limits do we have?", "API rate limits are 1000 req/min per token."),
    ("How do we handle token refresh?", "The client library auto-refreshes 5 min before expiry."),
    ("What if the backend is down during refresh?", "We have a 30-second exponential backoff retry."),
]

async def multi_turn_chat():
    for i, (user_msg, llm_response) in enumerate(turns):
        print(f"\n--- Turn {i+1} ---")
        result = await session.run(
            message=user_msg,
            response=llm_response,
            top_k=3,
        )
        print(f"User: {user_msg}")
        print(f"Drift phase: {result.drift_phase}")
        print(f"Retrieved context:\n{result.context}\n")

# Run
asyncio.run(multi_turn_chat())