Skip to content

Production hardening

Thread safety of SemvecState.update()

SemvecState is a Rust extension type exposed via PyO3. The wheel ships a single _core.abi3.so and the Python layer (SessionManager) wraps each session in a _Session dataclass with no per-instance Python-level lock.

The collection of sessions is protected by a threading.Lock inside SessionManager (lookup, insert, evict). That lock is released before the caller ever touches SemvecState.update(). Concurrent update() calls on the same SemvecState instance therefore depend on whatever locking the Rust core does internally, which is not part of the public contract in 0.6.0.

Operating rule:

Treat each SemvecState instance as single-writer. Serialize update() calls per session at the application layer.

Recommended patterns:

# asyncio app
import asyncio
session_locks: dict[str, asyncio.Lock] = {}

async def safe_update(session_id, state, vec, text):
    lock = session_locks.setdefault(session_id, asyncio.Lock())
    async with lock:
        return state.update(vec, text)
# threaded app — pin one state per thread, or wrap in a queue
from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1)  # one worker per session

The REST layer already enforces this for you: /v1/run and /v1/store route every request for session_id=X through the same in-process _Session, and FastAPI's per-request handler keeps the call serialized within a single worker. Across multiple uvicorn/granian workers you need sticky routing — see the REST reference on SessionManager not being shared across worker processes.

update_batch() is the recommended path for bulk ingest: it accepts a list of (embedding, text) pairs and lets the Rust core amortize the state mutations. Do not parallelize multiple update_batch() calls on the same state.

Graceful shutdown

SessionManager.shutdown(timeout=5.0) is wired into the FastAPI lifespan and fires on SIGTERM (uvicorn / granian both honor it). The sequence on SIGTERM is:

  1. Stop accepting new connections. uvicorn closes the listening socket.
  2. Drain in-flight requests. Active /v1/run, /v1/store and other handlers finish their current call. uvicorn's default timeout_graceful_shutdown is unbounded — set it explicitly (--timeout-graceful-shutdown 30) to bound the wait.
  3. Cancel the session sweeper (semvec-session-sweeper asyncio task), await its exit.
  4. SessionManager.shutdown(timeout=5.0): drain the shared embedder (5 s budget — beyond that the embedder is dropped without waiting), then self._sessions.clear().
  5. Process exit.

What is lost at step 4:

  • In-memory SemvecState for every session that wasn't snapshotted. clear() drops the Rust handles. SQLAlchemy session/cluster/member metadata stays in the database; the hot semantic state does not. Restoring after restart requires POST /v1/session/{id}/import with a snapshot taken via GET /v1/session/{id}/export while the process was still up.
  • Embedder batches in flight that did not complete within 5 s. The sidecar/batched embedder's pending Futures resolve with a shutdown error; the calling handler propagates a 5xx to the client.

What is preserved:

  • Compliance pack event-store rows (Postgres / SQLite — already fsynced per POST /v1/store).
  • Audit chain (HMAC / RS256 entries written synchronously).
  • Every SQLAlchemy row for sessions, clusters, regions, observers, members, audit events.

Operational rule: if your application cannot tolerate snapshot loss, issue GET /v1/session/{id}/export on a checkpoint cadence (every N turns or every M minutes) and store the result alongside your own application state. The 0.6.0 server does not auto-snapshot.

Systemd example:

[Service]
ExecStart=/opt/semvec/.venv/bin/semvec serve --host 127.0.0.1 --port 8080
KillSignal=SIGTERM
TimeoutStopSec=60s
Restart=on-failure

Kubernetes example:

spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: semvec
    lifecycle:
      preStop:
        exec:
          # Stop receiving new traffic, then let SIGTERM drain in-flight.
          command: ["sleep", "10"]

The preStop sleep 10 is a deliberate hack — it lets the service-mesh / Endpoints controller drop the pod from rotation before SIGTERM fires, so no new connection lands on a pod that's already draining.

Liveness vs readiness

0.6.0 ships one health endpoint: GET /v1/health (no auth, returns status, active_sessions, version). It is suitable as a liveness probe — it answers if the asyncio loop is alive and the SessionManager is reachable.

It is not a readiness probe. It does not check:

  • Embedder is loaded (first call lazy-loads it; a fresh worker will return 200 on /v1/health while the next /v1/run blocks for 10–30 s on model download / GPU warm-up).
  • Database connection is healthy.
  • Sidecar embedder (SEMVEC_EMBEDDER_URL) is reachable.
  • License-verification keyset is loaded.

Workarounds until a dedicated readiness endpoint ships:

K8s exec probe — call the API once during readiness, fail if it doesn't return memories:

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      curl -fsS -H "X-API-Key: $SEMVEC_LICENSE_KEY" \
        -X POST http://127.0.0.1:8080/v1/run \
        -H 'Content-Type: application/json' \
        -d '{"message":"readiness probe"}' \
        | grep -q session_id
  initialDelaySeconds: 30
  periodSeconds: 15
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /v1/health
    port: 8080
  periodSeconds: 10
  failureThreshold: 5

Pre-warm in the entrypoint — make the embedder load synchronously before uvicorn starts taking traffic:

.venv/bin/python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
.venv/bin/semvec serve --host 0.0.0.0 --port 8080

Combine both: the entrypoint pre-warm makes the first /v1/run fast, the readiness probe protects against silent embedder failures (out-of-disk on model download, sidecar 503, etc.).

Prometheus label cardinality and tenant leaks

/metrics exports three series from a private CollectorRegistry:

Series Type Labels
semvec_requests_total Counter method, endpoint, status
semvec_request_duration_seconds Histogram method, endpoint
semvec_active_sessions Gauge

The endpoint label is the route template (e.g. /v1/session/{session_id}), not the resolved URL — a deliberate fix to keep cardinality bounded. UUID session/cluster/region IDs never reach the label.

No tenant-scoped series ship. license_subject is not a Prometheus label in 0.6.0. The Counter and Histogram aggregate across all callers. Implication:

  • A noisy tenant cannot be isolated from /metrics alone — you need application logs (which do carry session_id) plus log-based aggregation.
  • Conversely, you do not need to strip tenant data from /metrics before exposing it to a shared monitoring stack.

If you add custom Counters/Histograms to a fork or downstream deployment, do not add session_id, cluster_id, region_id, memory_hash, or license_subject as labels. Each is unbounded in cardinality (UUIDs, BLAKE3 hashes) and will blow up the Prometheus backend. Use exemplars or trace-IDs for per-request drilldown instead.

OpenTelemetry hooks

Semvec 0.6.0 does not emit OpenTelemetry traces or metrics natively. If you need distributed tracing across your agent → /v1/runembedder sidecar → database hops, instrument at the FastAPI layer with opentelemetry-instrumentation-fastapi:

pip install opentelemetry-instrumentation-fastapi \
            opentelemetry-instrumentation-sqlalchemy \
            opentelemetry-exporter-otlp

Install

OpenTelemetry instrumentation is opt-in: pip install opentelemetry-instrumentation-fastapi opentelemetry-sdk opentelemetry-exporter-otlp. semvec does not bundle OTel.

# wrapper around semvec.api:create_app
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from semvec.api import create_app

app = create_app()
FastAPIInstrumentor.instrument_app(app)

This captures HTTP-level spans (method, route, status, duration). It does not see inside the Rust core — SemvecState.update() is a single PyO3 call, opaque to Python-level tracing. The same is true for sidecar embedder calls if you do not separately instrument the embedder process.

Connection draining for the embedder sidecar

When SEMVEC_EMBEDDER_URL is set, the API process owns a SidecarEmbedderClient (HTTP keep-alive pool) instead of an in-process SentenceTransformer. Shutdown order matters:

  1. Drain the API first (SIGTERM semvec serve workers). The SessionManager.shutdown() lifespan hook closes the embedder client inside the 5 s drain window.
  2. Then drain the sidecar. SIGTERM python -m semvec.embedder. It has its own batch-flush logic; the daemon will reject new HTTP submissions and finish in-flight batches before exiting.

Reversing the order (sidecar first) causes the API workers to surface ConnectionRefusedError on every in-flight /v1/run until they finish draining. Always shut the consumer down before the producer.

If you run sidecar and API as separate Deployments behind a Service, add a preStop sleep on the sidecar Deployment that is longer than terminationGracePeriodSeconds on the API Deployment, so the sidecar outlives any API pod that's still draining.

State persistence & durability

The "Graceful shutdown" section above describes the 0.6.0 baseline: in-memory state, lost on clear(), restored only by an explicit export/import cadence you wire yourself. From 0.7.1 the server can do that for you — durable per-session state behind one environment variable.

SEMVEC_STATE_PERSIST — durable per-session state (default OFF)

Set SEMVEC_STATE_PERSIST=1 and point DATABASE_URL at Postgres to make each session's semantic state survive a worker restart. The mechanics:

  • Write-behind, not write-through. A flush is not taken on every turn. The server flushes a session's state on a periodic tick and again on SIGTERM (inside the graceful-drain window). This keeps the hot path off the database — turns stay in memory and only the flush pays the write cost.
  • Lazy reload. State is read back from the store on the first access to a session after a restart, not eagerly on boot. A fresh worker starts cold and hydrates a session the moment a request for it arrives.
  • Recovery point (RPO). On a graceful shutdown (SIGTERM) the final flush fires, so a session resumes bit-exact. On a hard crash (OOM kill, SIGKILL, node loss) you can lose up to one flush interval of turns — whatever changed since the last tick. Size the flush interval to the loss you can tolerate.

State blobs are stored as compressed binary (~56% smaller than the legacy JSON encoding). The reader is backward-compatible: legacy uncompressed JSON blobs written by older deployments are still read transparently, so an in-place upgrade needs no migration step.

One owner worker per session

Persistence assumes a single writer per session. There is no distributed lock — instead, each flush carries an optimistic version, and the store refuses a write whose version is stale (a compare-and-set that would clobber a newer blob). That refusal is loud (logged, surfaced), not a silent last-writer-wins.

The practical consequence: route every request for a given session_id to the same worker (sticky routing — consistent hashing at the load balancer, or a session-affinity cookie). If two workers both own the same live session and both flush, the CAS rejects the loser rather than corrupting the blob, but you will see churn and lost writes on the rejected side. Sticky routing avoids the contention entirely. This is the same single-writer rule as Thread safety of SemvecState.update(), extended across worker processes.

SEMVEC_STATE_DB_SHARDS — Postgres sharding (opt-in, advanced scaling)

For deployments where a single Postgres primary is the write-throughput ceiling, SEMVEC_STATE_DB_SHARDS takes a comma-separated list of Postgres DSNs and spreads the state-blob table across them:

export SEMVEC_STATE_PERSIST=1
export SEMVEC_STATE_DB_SHARDS="postgresql://u:p@shard-a/semvec,postgresql://u:p@shard-b/semvec,postgresql://u:p@shard-c/semvec"
  • Sharded by session_id via rendezvous (HRW) hashing. Each session maps to exactly one shard. Rendezvous hashing is chosen for its minimal reshuffle property: adding a shard moves only ~1/N of the keys (vs. a modulo scheme that remaps almost everything), so you can grow the pool with bounded rebalancing.
  • Metadata stays on the primary. Only the high-volume state-blob table is sharded. The session / cluster / region / audit metadata tables continue to live on the primary DATABASE_URL, so ownership checks and the compliance surface are unaffected.

This is a write-throughput lever, not a write-amplification fix. Sharding spreads writes across more backends; it does not reduce how much each turn writes. The orthogonal lever for that is blob compression (above), which shrinks each write ~56%. Use both together at scale. (Ships in 0.7.1.)

Persisting state in-process / library use

Automatic persistence is a server convenience — it lives in the FastAPI lifespan. A library user driving SemvecState / SemvecSession directly (no FastAPI) persists manually, and the primitive is the same compressed blob the server uses:

# --- on shutdown / checkpoint: serialize and store anywhere ---
blob = state.to_bytes(compress=True)   # compact binary, ~56% smaller
# persist `blob` wherever you like: a file, your own DB column, an
# object store (S3/GCS), etc. — it's opaque bytes.
Path("session-42.semvec").write_bytes(blob)

# --- on startup: load it back ---
blob = Path("session-42.semvec").read_bytes()
state = SemvecState.from_bytes(blob)   # auto-detects compressed vs legacy

from_bytes() auto-detects the encoding, so a blob written by an older build (uncompressed) still loads. See to_bytes / from_bytes in the Core API reference.

Durable cross-frontend dedup in-process is the same recipe plus one shared session: hold a single SemvecSession, feed every frontend's turns through it, and save/load it with to_bytes/from_bytes on the boundaries. The matched_id in each dedup_signal is a durable handle — it survives the to_bytes/from_bytes round-trip, so a correlation stored before the restart still resolves afterward. See Cross-frontend dedup → in-process.