Production hardening¶
Thread safety of SemvecState.update()¶
SemvecState is a Rust extension type exposed via PyO3. The wheel ships a
single _core.abi3.so and the Python layer (SessionManager) wraps each
session in a _Session dataclass with no per-instance Python-level lock.
The collection of sessions is protected by a threading.Lock inside
SessionManager (lookup, insert, evict). That lock is released before the
caller ever touches SemvecState.update(). Concurrent update() calls on
the same SemvecState instance therefore depend on whatever locking the
Rust core does internally, which is not part of the public contract in
0.6.0.
Operating rule:
Treat each
SemvecStateinstance as single-writer. Serializeupdate()calls per session at the application layer.
Recommended patterns:
# asyncio app
import asyncio
session_locks: dict[str, asyncio.Lock] = {}
async def safe_update(session_id, state, vec, text):
lock = session_locks.setdefault(session_id, asyncio.Lock())
async with lock:
return state.update(vec, text)
# threaded app — pin one state per thread, or wrap in a queue
from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1) # one worker per session
The REST layer already enforces this for you: /v1/run and /v1/store
route every request for session_id=X through the same in-process
_Session, and FastAPI's per-request handler keeps the call serialized
within a single worker. Across multiple uvicorn/granian workers you
need sticky routing — see the REST reference
on SessionManager not being shared across worker processes.
update_batch() is the recommended path for bulk ingest: it accepts a
list of (embedding, text) pairs and lets the Rust core amortize the
state mutations. Do not parallelize multiple update_batch() calls
on the same state.
Graceful shutdown¶
SessionManager.shutdown(timeout=5.0) is wired into the FastAPI
lifespan and fires on SIGTERM (uvicorn / granian both honor it). The
sequence on SIGTERM is:
- Stop accepting new connections. uvicorn closes the listening socket.
- Drain in-flight requests. Active
/v1/run,/v1/storeand other handlers finish their current call. uvicorn's defaulttimeout_graceful_shutdownis unbounded — set it explicitly (--timeout-graceful-shutdown 30) to bound the wait. - Cancel the session sweeper (
semvec-session-sweeperasyncio task), await its exit. SessionManager.shutdown(timeout=5.0): drain the shared embedder (5 s budget — beyond that the embedder is dropped without waiting), thenself._sessions.clear().- Process exit.
What is lost at step 4:
- In-memory
SemvecStatefor every session that wasn't snapshotted.clear()drops the Rust handles. SQLAlchemy session/cluster/member metadata stays in the database; the hot semantic state does not. Restoring after restart requiresPOST /v1/session/{id}/importwith a snapshot taken viaGET /v1/session/{id}/exportwhile the process was still up. - Embedder batches in flight that did not complete within 5 s. The sidecar/batched embedder's pending Futures resolve with a shutdown error; the calling handler propagates a 5xx to the client.
What is preserved:
- Compliance pack event-store rows (Postgres / SQLite — already fsynced
per
POST /v1/store). - Audit chain (HMAC / RS256 entries written synchronously).
- Every SQLAlchemy row for sessions, clusters, regions, observers, members, audit events.
Operational rule: if your application cannot tolerate snapshot loss,
issue GET /v1/session/{id}/export on a checkpoint cadence
(every N turns or every M minutes) and store the result alongside your
own application state. The 0.6.0 server does not auto-snapshot.
Systemd example:
[Service]
ExecStart=/opt/semvec/.venv/bin/semvec serve --host 127.0.0.1 --port 8080
KillSignal=SIGTERM
TimeoutStopSec=60s
Restart=on-failure
Kubernetes example:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: semvec
lifecycle:
preStop:
exec:
# Stop receiving new traffic, then let SIGTERM drain in-flight.
command: ["sleep", "10"]
The preStop sleep 10 is a deliberate hack — it lets the service-mesh /
Endpoints controller drop the pod from rotation before SIGTERM fires, so
no new connection lands on a pod that's already draining.
Liveness vs readiness¶
0.6.0 ships one health endpoint: GET /v1/health (no auth, returns
status, active_sessions, version). It is suitable as a liveness
probe — it answers if the asyncio loop is alive and the SessionManager
is reachable.
It is not a readiness probe. It does not check:
- Embedder is loaded (first call lazy-loads it; a fresh worker will
return 200 on
/v1/healthwhile the next/v1/runblocks for 10–30 s on model download / GPU warm-up). - Database connection is healthy.
- Sidecar embedder (
SEMVEC_EMBEDDER_URL) is reachable. - License-verification keyset is loaded.
Workarounds until a dedicated readiness endpoint ships:
K8s exec probe — call the API once during readiness, fail if it doesn't return memories:
readinessProbe:
exec:
command:
- /bin/sh
- -c
- |
curl -fsS -H "X-API-Key: $SEMVEC_LICENSE_KEY" \
-X POST http://127.0.0.1:8080/v1/run \
-H 'Content-Type: application/json' \
-d '{"message":"readiness probe"}' \
| grep -q session_id
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 3
livenessProbe:
httpGet:
path: /v1/health
port: 8080
periodSeconds: 10
failureThreshold: 5
Pre-warm in the entrypoint — make the embedder load synchronously before uvicorn starts taking traffic:
.venv/bin/python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
.venv/bin/semvec serve --host 0.0.0.0 --port 8080
Combine both: the entrypoint pre-warm makes the first /v1/run fast,
the readiness probe protects against silent embedder failures
(out-of-disk on model download, sidecar 503, etc.).
Prometheus label cardinality and tenant leaks¶
/metrics exports three series from a private CollectorRegistry:
| Series | Type | Labels |
|---|---|---|
semvec_requests_total |
Counter | method, endpoint, status |
semvec_request_duration_seconds |
Histogram | method, endpoint |
semvec_active_sessions |
Gauge | — |
The endpoint label is the route template (e.g.
/v1/session/{session_id}), not the resolved URL — a deliberate fix
to keep cardinality bounded. UUID session/cluster/region IDs never
reach the label.
No tenant-scoped series ship. license_subject is not a Prometheus
label in 0.6.0. The Counter and Histogram aggregate across all callers.
Implication:
- A noisy tenant cannot be isolated from
/metricsalone — you need application logs (which do carry session_id) plus log-based aggregation. - Conversely, you do not need to strip tenant data from
/metricsbefore exposing it to a shared monitoring stack.
If you add custom Counters/Histograms to a fork or downstream
deployment, do not add session_id, cluster_id, region_id,
memory_hash, or license_subject as labels. Each is unbounded in
cardinality (UUIDs, BLAKE3 hashes) and will blow up the Prometheus
backend. Use exemplars or trace-IDs for per-request drilldown instead.
OpenTelemetry hooks¶
Semvec 0.6.0 does not emit OpenTelemetry traces or metrics natively.
If you need distributed tracing across your agent → /v1/run → embedder
sidecar → database hops, instrument at the FastAPI layer with
opentelemetry-instrumentation-fastapi:
pip install opentelemetry-instrumentation-fastapi \
opentelemetry-instrumentation-sqlalchemy \
opentelemetry-exporter-otlp
Install
OpenTelemetry instrumentation is opt-in: pip install
opentelemetry-instrumentation-fastapi opentelemetry-sdk
opentelemetry-exporter-otlp. semvec does not bundle OTel.
# wrapper around semvec.api:create_app
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from semvec.api import create_app
app = create_app()
FastAPIInstrumentor.instrument_app(app)
This captures HTTP-level spans (method, route, status, duration). It
does not see inside the Rust core — SemvecState.update() is a
single PyO3 call, opaque to Python-level tracing. The same is true for
sidecar embedder calls if you do not separately instrument the embedder
process.
Connection draining for the embedder sidecar¶
When SEMVEC_EMBEDDER_URL is set, the API process owns a
SidecarEmbedderClient (HTTP keep-alive pool) instead of an in-process
SentenceTransformer. Shutdown order matters:
- Drain the API first (SIGTERM
semvec serveworkers). TheSessionManager.shutdown()lifespan hook closes the embedder client inside the 5 s drain window. - Then drain the sidecar. SIGTERM
python -m semvec.embedder. It has its own batch-flush logic; the daemon will reject new HTTP submissions and finish in-flight batches before exiting.
Reversing the order (sidecar first) causes the API workers to surface
ConnectionRefusedError on every in-flight /v1/run until they finish
draining. Always shut the consumer down before the producer.
If you run sidecar and API as separate Deployments behind a Service,
add a preStop sleep on the sidecar Deployment that is longer
than terminationGracePeriodSeconds on the API Deployment, so the
sidecar outlives any API pod that's still draining.