Production hardening¶

Thread safety of `SemvecState.update()`¶

SemvecState is a Rust extension type exposed via PyO3. The wheel ships a single _core.abi3.so and the Python layer (SessionManager) wraps each session in a _Session dataclass with no per-instance Python-level lock.

The collection of sessions is protected by a threading.Lock inside SessionManager (lookup, insert, evict). That lock is released before the caller ever touches SemvecState.update(). Concurrent update() calls on the same SemvecState instance therefore depend on whatever locking the Rust core does internally, which is not part of the public contract in 0.6.0.

Operating rule:

Treat each SemvecState instance as single-writer. Serialize update() calls per session at the application layer.

Recommended patterns:

# asyncio app
import asyncio
session_locks: dict[str, asyncio.Lock] = {}

async def safe_update(session_id, state, vec, text):
    lock = session_locks.setdefault(session_id, asyncio.Lock())
    async with lock:
        return state.update(vec, text)

# threaded app — pin one state per thread, or wrap in a queue
from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1)  # one worker per session

The REST layer already enforces this for you: /v1/run and /v1/store route every request for session_id=X through the same in-process _Session, and FastAPI's per-request handler keeps the call serialized within a single worker. Across multiple uvicorn/granian workers you need sticky routing — see the REST reference on SessionManager not being shared across worker processes.

update_batch() is the recommended path for bulk ingest: it accepts a list of (embedding, text) pairs and lets the Rust core amortize the state mutations. Do not parallelize multiple update_batch() calls on the same state.

Graceful shutdown¶

SessionManager.shutdown(timeout=5.0) is wired into the FastAPI lifespan and fires on SIGTERM (uvicorn / granian both honor it). The sequence on SIGTERM is:

Stop accepting new connections. uvicorn closes the listening socket.
Drain in-flight requests. Active /v1/run, /v1/store and other handlers finish their current call. uvicorn's default timeout_graceful_shutdown is unbounded — set it explicitly (--timeout-graceful-shutdown 30) to bound the wait.
Cancel the session sweeper (semvec-session-sweeper asyncio task), await its exit.
SessionManager.shutdown(timeout=5.0): drain the shared embedder (5 s budget — beyond that the embedder is dropped without waiting), then self._sessions.clear().
Process exit.

What is lost at step 4:

In-memory SemvecState for every session that wasn't snapshotted. clear() drops the Rust handles. SQLAlchemy session/cluster/member metadata stays in the database; the hot semantic state does not. Restoring after restart requires POST /v1/session/{id}/import with a snapshot taken via GET /v1/session/{id}/export while the process was still up.
Embedder batches in flight that did not complete within 5 s. The sidecar/batched embedder's pending Futures resolve with a shutdown error; the calling handler propagates a 5xx to the client.

What is preserved:

Compliance pack event-store rows (Postgres / SQLite — already fsynced per POST /v1/store).
Audit chain (HMAC / RS256 entries written synchronously).
Every SQLAlchemy row for sessions, clusters, regions, observers, members, audit events.

Operational rule: if your application cannot tolerate snapshot loss, issue GET /v1/session/{id}/export on a checkpoint cadence (every N turns or every M minutes) and store the result alongside your own application state. The 0.6.0 server does not auto-snapshot.

Systemd example:

[Service]
ExecStart=/opt/semvec/.venv/bin/semvec serve --host 127.0.0.1 --port 8080
KillSignal=SIGTERM
TimeoutStopSec=60s
Restart=on-failure

Kubernetes example:

spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: semvec
    lifecycle:
      preStop:
        exec:
          # Stop receiving new traffic, then let SIGTERM drain in-flight.
          command: ["sleep", "10"]

The preStop sleep 10 is a deliberate hack — it lets the service-mesh / Endpoints controller drop the pod from rotation before SIGTERM fires, so no new connection lands on a pod that's already draining.

Liveness vs readiness¶

0.6.0 ships one health endpoint: GET /v1/health (no auth, returns status, active_sessions, version). It is suitable as a liveness probe — it answers if the asyncio loop is alive and the SessionManager is reachable.

It is not a readiness probe. It does not check:

Embedder is loaded (first call lazy-loads it; a fresh worker will return 200 on /v1/health while the next /v1/run blocks for 10–30 s on model download / GPU warm-up).
Database connection is healthy.
Sidecar embedder (SEMVEC_EMBEDDER_URL) is reachable.
License-verification keyset is loaded.

Workarounds until a dedicated readiness endpoint ships:

K8s exec probe — call the API once during readiness, fail if it doesn't return memories:

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      curl -fsS -H "X-API-Key: $SEMVEC_LICENSE_KEY" \
        -X POST http://127.0.0.1:8080/v1/run \
        -H 'Content-Type: application/json' \
        -d '{"message":"readiness probe"}' \
        | grep -q session_id
  initialDelaySeconds: 30
  periodSeconds: 15
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /v1/health
    port: 8080
  periodSeconds: 10
  failureThreshold: 5

Pre-warm in the entrypoint — make the embedder load synchronously before uvicorn starts taking traffic:

.venv/bin/python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
.venv/bin/semvec serve --host 0.0.0.0 --port 8080

Combine both: the entrypoint pre-warm makes the first /v1/run fast, the readiness probe protects against silent embedder failures (out-of-disk on model download, sidecar 503, etc.).

Prometheus label cardinality and tenant leaks¶

/metrics exports three series from a private CollectorRegistry:

Series	Type	Labels
`semvec_requests_total`	Counter	`method`, `endpoint`, `status`
`semvec_request_duration_seconds`	Histogram	`method`, `endpoint`
`semvec_active_sessions`	Gauge	—

The endpoint label is the route template (e.g. /v1/session/{session_id}), not the resolved URL — a deliberate fix to keep cardinality bounded. UUID session/cluster/region IDs never reach the label.

No tenant-scoped series ship. license_subject is not a Prometheus label in 0.6.0. The Counter and Histogram aggregate across all callers. Implication:

A noisy tenant cannot be isolated from /metrics alone — you need application logs (which do carry session_id) plus log-based aggregation.
Conversely, you do not need to strip tenant data from /metrics before exposing it to a shared monitoring stack.

If you add custom Counters/Histograms to a fork or downstream deployment, do not add session_id, cluster_id, region_id, memory_hash, or license_subject as labels. Each is unbounded in cardinality (UUIDs, BLAKE3 hashes) and will blow up the Prometheus backend. Use exemplars or trace-IDs for per-request drilldown instead.

OpenTelemetry hooks¶

Semvec 0.6.0 does not emit OpenTelemetry traces or metrics natively. If you need distributed tracing across your agent → /v1/run → embedder sidecar → database hops, instrument at the FastAPI layer with opentelemetry-instrumentation-fastapi:

pip install opentelemetry-instrumentation-fastapi \
            opentelemetry-instrumentation-sqlalchemy \
            opentelemetry-exporter-otlp

Install

OpenTelemetry instrumentation is opt-in: pip install opentelemetry-instrumentation-fastapi opentelemetry-sdk opentelemetry-exporter-otlp. semvec does not bundle OTel.

# wrapper around semvec.api:create_app
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from semvec.api import create_app

app = create_app()
FastAPIInstrumentor.instrument_app(app)

This captures HTTP-level spans (method, route, status, duration). It does not see inside the Rust core — SemvecState.update() is a single PyO3 call, opaque to Python-level tracing. The same is true for sidecar embedder calls if you do not separately instrument the embedder process.

Connection draining for the embedder sidecar¶

When SEMVEC_EMBEDDER_URL is set, the API process owns a SidecarEmbedderClient (HTTP keep-alive pool) instead of an in-process SentenceTransformer. Shutdown order matters:

Drain the API first (SIGTERM semvec serve workers). The SessionManager.shutdown() lifespan hook closes the embedder client inside the 5 s drain window.
Then drain the sidecar. SIGTERM python -m semvec.embedder. It has its own batch-flush logic; the daemon will reject new HTTP submissions and finish in-flight batches before exiting.

Reversing the order (sidecar first) causes the API workers to surface ConnectionRefusedError on every in-flight /v1/run until they finish draining. Always shut the consumer down before the producer.

If you run sidecar and API as separate Deployments behind a Service, add a preStop sleep on the sidecar Deployment that is longer than terminationGracePeriodSeconds on the API Deployment, so the sidecar outlives any API pod that's still draining.