Production hardening¶
Thread safety of SemvecState.update()¶
SemvecState is a Rust extension type exposed via PyO3. The wheel ships a
single _core.abi3.so and the Python layer (SessionManager) wraps each
session in a _Session dataclass with no per-instance Python-level lock.
The collection of sessions is protected by a threading.Lock inside
SessionManager (lookup, insert, evict). That lock is released before the
caller ever touches SemvecState.update(). Concurrent update() calls on
the same SemvecState instance therefore depend on whatever locking the
Rust core does internally, which is not part of the public contract in
0.6.0.
Operating rule:
Treat each
SemvecStateinstance as single-writer. Serializeupdate()calls per session at the application layer.
Recommended patterns:
# asyncio app
import asyncio
session_locks: dict[str, asyncio.Lock] = {}
async def safe_update(session_id, state, vec, text):
lock = session_locks.setdefault(session_id, asyncio.Lock())
async with lock:
return state.update(vec, text)
# threaded app — pin one state per thread, or wrap in a queue
from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1) # one worker per session
The REST layer already enforces this for you: /v1/run and /v1/store
route every request for session_id=X through the same in-process
_Session, and FastAPI's per-request handler keeps the call serialized
within a single worker. Across multiple uvicorn/granian workers you
need sticky routing — see the REST reference
on SessionManager not being shared across worker processes.
update_batch() is the recommended path for bulk ingest: it accepts a
list of (embedding, text) pairs and lets the Rust core amortize the
state mutations. Do not parallelize multiple update_batch() calls
on the same state.
Graceful shutdown¶
SessionManager.shutdown(timeout=5.0) is wired into the FastAPI
lifespan and fires on SIGTERM (uvicorn / granian both honor it). The
sequence on SIGTERM is:
- Stop accepting new connections. uvicorn closes the listening socket.
- Drain in-flight requests. Active
/v1/run,/v1/storeand other handlers finish their current call. uvicorn's defaulttimeout_graceful_shutdownis unbounded — set it explicitly (--timeout-graceful-shutdown 30) to bound the wait. - Cancel the session sweeper (
semvec-session-sweeperasyncio task), await its exit. SessionManager.shutdown(timeout=5.0): drain the shared embedder (5 s budget — beyond that the embedder is dropped without waiting), thenself._sessions.clear().- Process exit.
What is lost at step 4:
- In-memory
SemvecStatefor every session that wasn't snapshotted.clear()drops the Rust handles. SQLAlchemy session/cluster/member metadata stays in the database; the hot semantic state does not. Restoring after restart requiresPOST /v1/session/{id}/importwith a snapshot taken viaGET /v1/session/{id}/exportwhile the process was still up. - Embedder batches in flight that did not complete within 5 s. The sidecar/batched embedder's pending Futures resolve with a shutdown error; the calling handler propagates a 5xx to the client.
What is preserved:
- Compliance pack event-store rows (Postgres / SQLite — already fsynced
per
POST /v1/store). - Audit chain (HMAC / RS256 entries written synchronously).
- Every SQLAlchemy row for sessions, clusters, regions, observers, members, audit events.
Operational rule: if your application cannot tolerate snapshot loss,
issue GET /v1/session/{id}/export on a checkpoint cadence
(every N turns or every M minutes) and store the result alongside your
own application state. The 0.6.0 server does not auto-snapshot.
Systemd example:
[Service]
ExecStart=/opt/semvec/.venv/bin/semvec serve --host 127.0.0.1 --port 8080
KillSignal=SIGTERM
TimeoutStopSec=60s
Restart=on-failure
Kubernetes example:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: semvec
lifecycle:
preStop:
exec:
# Stop receiving new traffic, then let SIGTERM drain in-flight.
command: ["sleep", "10"]
The preStop sleep 10 is a deliberate hack — it lets the service-mesh /
Endpoints controller drop the pod from rotation before SIGTERM fires, so
no new connection lands on a pod that's already draining.
Liveness vs readiness¶
0.6.0 ships one health endpoint: GET /v1/health (no auth, returns
status, active_sessions, version). It is suitable as a liveness
probe — it answers if the asyncio loop is alive and the SessionManager
is reachable.
It is not a readiness probe. It does not check:
- Embedder is loaded (first call lazy-loads it; a fresh worker will
return 200 on
/v1/healthwhile the next/v1/runblocks for 10–30 s on model download / GPU warm-up). - Database connection is healthy.
- Sidecar embedder (
SEMVEC_EMBEDDER_URL) is reachable. - License-verification keyset is loaded.
Workarounds until a dedicated readiness endpoint ships:
K8s exec probe — call the API once during readiness, fail if it doesn't return memories:
readinessProbe:
exec:
command:
- /bin/sh
- -c
- |
curl -fsS -H "X-API-Key: $SEMVEC_LICENSE_KEY" \
-X POST http://127.0.0.1:8080/v1/run \
-H 'Content-Type: application/json' \
-d '{"message":"readiness probe"}' \
| grep -q session_id
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 3
livenessProbe:
httpGet:
path: /v1/health
port: 8080
periodSeconds: 10
failureThreshold: 5
Pre-warm in the entrypoint — make the embedder load synchronously before uvicorn starts taking traffic:
.venv/bin/python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
.venv/bin/semvec serve --host 0.0.0.0 --port 8080
Combine both: the entrypoint pre-warm makes the first /v1/run fast,
the readiness probe protects against silent embedder failures
(out-of-disk on model download, sidecar 503, etc.).
Prometheus label cardinality and tenant leaks¶
/metrics exports three series from a private CollectorRegistry:
| Series | Type | Labels |
|---|---|---|
semvec_requests_total |
Counter | method, endpoint, status |
semvec_request_duration_seconds |
Histogram | method, endpoint |
semvec_active_sessions |
Gauge | — |
The endpoint label is the route template (e.g.
/v1/session/{session_id}), not the resolved URL — a deliberate fix
to keep cardinality bounded. UUID session/cluster/region IDs never
reach the label.
No tenant-scoped series ship. license_subject is not a Prometheus
label in 0.6.0. The Counter and Histogram aggregate across all callers.
Implication:
- A noisy tenant cannot be isolated from
/metricsalone — you need application logs (which do carry session_id) plus log-based aggregation. - Conversely, you do not need to strip tenant data from
/metricsbefore exposing it to a shared monitoring stack.
If you add custom Counters/Histograms to a fork or downstream
deployment, do not add session_id, cluster_id, region_id,
memory_hash, or license_subject as labels. Each is unbounded in
cardinality (UUIDs, BLAKE3 hashes) and will blow up the Prometheus
backend. Use exemplars or trace-IDs for per-request drilldown instead.
OpenTelemetry hooks¶
Semvec 0.6.0 does not emit OpenTelemetry traces or metrics natively.
If you need distributed tracing across your agent → /v1/run → embedder
sidecar → database hops, instrument at the FastAPI layer with
opentelemetry-instrumentation-fastapi:
pip install opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-sqlalchemy \
opentelemetry-exporter-otlp
Install
OpenTelemetry instrumentation is opt-in: pip install
opentelemetry-instrumentation-fastapi opentelemetry-sdk
opentelemetry-exporter-otlp. semvec does not bundle OTel.
# wrapper around semvec.api:create_app
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from semvec.api import create_app
app = create_app()
FastAPIInstrumentor.instrument_app(app)
This captures HTTP-level spans (method, route, status, duration). It
does not see inside the Rust core — SemvecState.update() is a
single PyO3 call, opaque to Python-level tracing. The same is true for
sidecar embedder calls if you do not separately instrument the embedder
process.
Connection draining for the embedder sidecar¶
When SEMVEC_EMBEDDER_URL is set, the API process owns a
SidecarEmbedderClient (HTTP keep-alive pool) instead of an in-process
SentenceTransformer. Shutdown order matters:
- Drain the API first (SIGTERM
semvec serveworkers). TheSessionManager.shutdown()lifespan hook closes the embedder client inside the 5 s drain window. - Then drain the sidecar. SIGTERM
python -m semvec.embedder. It has its own batch-flush logic; the daemon will reject new HTTP submissions and finish in-flight batches before exiting.
Reversing the order (sidecar first) causes the API workers to surface
ConnectionRefusedError on every in-flight /v1/run until they finish
draining. Always shut the consumer down before the producer.
If you run sidecar and API as separate Deployments behind a Service,
add a preStop sleep on the sidecar Deployment that is longer
than terminationGracePeriodSeconds on the API Deployment, so the
sidecar outlives any API pod that's still draining.
State persistence & durability¶
The "Graceful shutdown" section above describes the 0.6.0 baseline:
in-memory state, lost on clear(), restored only by an explicit
export/import cadence you wire yourself. From 0.7.1 the server can
do that for you — durable per-session state behind one environment variable.
SEMVEC_STATE_PERSIST — durable per-session state (default OFF)¶
Set SEMVEC_STATE_PERSIST=1 and point DATABASE_URL at Postgres to make
each session's semantic state survive a worker restart. The mechanics:
- Write-behind, not write-through. A flush is not taken on every turn. The server flushes a session's state on a periodic tick and again on SIGTERM (inside the graceful-drain window). This keeps the hot path off the database — turns stay in memory and only the flush pays the write cost.
- Lazy reload. State is read back from the store on the first access to a session after a restart, not eagerly on boot. A fresh worker starts cold and hydrates a session the moment a request for it arrives.
- Recovery point (RPO). On a graceful shutdown (SIGTERM) the final flush
fires, so a session resumes bit-exact. On a hard crash (OOM kill,
SIGKILL, node loss) you can lose up to one flush interval of turns — whatever changed since the last tick. Size the flush interval to the loss you can tolerate.
State blobs are stored as compressed binary (~56% smaller than the legacy JSON encoding). The reader is backward-compatible: legacy uncompressed JSON blobs written by older deployments are still read transparently, so an in-place upgrade needs no migration step.
One owner worker per session¶
Persistence assumes a single writer per session. There is no distributed lock — instead, each flush carries an optimistic version, and the store refuses a write whose version is stale (a compare-and-set that would clobber a newer blob). That refusal is loud (logged, surfaced), not a silent last-writer-wins.
The practical consequence: route every request for a given session_id to the
same worker (sticky routing — consistent hashing at the load balancer, or
a session-affinity cookie). If two workers both own the same live session and
both flush, the CAS rejects the loser rather than corrupting the blob, but you
will see churn and lost writes on the rejected side. Sticky routing avoids the
contention entirely. This is the same single-writer rule as
Thread safety of SemvecState.update(),
extended across worker processes.
SEMVEC_STATE_DB_SHARDS — Postgres sharding (opt-in, advanced scaling)¶
For deployments where a single Postgres primary is the write-throughput
ceiling, SEMVEC_STATE_DB_SHARDS takes a comma-separated list of Postgres
DSNs and spreads the state-blob table across them:
export SEMVEC_STATE_PERSIST=1
export SEMVEC_STATE_DB_SHARDS="postgresql://u:p@shard-a/semvec,postgresql://u:p@shard-b/semvec,postgresql://u:p@shard-c/semvec"
- Sharded by
session_idvia rendezvous (HRW) hashing. Each session maps to exactly one shard. Rendezvous hashing is chosen for its minimal reshuffle property: adding a shard moves only ~1/Nof the keys (vs. a modulo scheme that remaps almost everything), so you can grow the pool with bounded rebalancing. - Metadata stays on the primary. Only the high-volume state-blob table is
sharded. The session / cluster / region / audit metadata tables continue to
live on the primary
DATABASE_URL, so ownership checks and the compliance surface are unaffected.
This is a write-throughput lever, not a write-amplification fix. Sharding spreads writes across more backends; it does not reduce how much each turn writes. The orthogonal lever for that is blob compression (above), which shrinks each write ~56%. Use both together at scale. (Ships in 0.7.1.)
Persisting state in-process / library use¶
Automatic persistence is a server convenience — it lives in the FastAPI
lifespan. A library user driving SemvecState / SemvecSession directly (no
FastAPI) persists manually, and the primitive is the same compressed blob
the server uses:
# --- on shutdown / checkpoint: serialize and store anywhere ---
blob = state.to_bytes(compress=True) # compact binary, ~56% smaller
# persist `blob` wherever you like: a file, your own DB column, an
# object store (S3/GCS), etc. — it's opaque bytes.
Path("session-42.semvec").write_bytes(blob)
# --- on startup: load it back ---
blob = Path("session-42.semvec").read_bytes()
state = SemvecState.from_bytes(blob) # auto-detects compressed vs legacy
from_bytes() auto-detects the encoding, so a blob written by an older build
(uncompressed) still loads. See
to_bytes / from_bytes in the Core API reference.
Durable cross-frontend dedup in-process is the same recipe plus one shared
session: hold a single SemvecSession, feed every frontend's turns through it,
and save/load it with to_bytes/from_bytes on the boundaries. The
matched_id in each dedup_signal is a durable handle — it survives the
to_bytes/from_bytes round-trip, so a correlation stored before the restart
still resolves afterward. See
Cross-frontend dedup → in-process.