CLI (semvec)
The semvec command ships with the [api] extra and wraps
uvicorn to run the REST API. Install it
via:
pip install "semvec[api]"
Commands
semvec serve
Start the Semvec REST API server.
semvec serve [--host HOST] [--port PORT] [--workers N] \
[--embedder URL] [--embedder-mode inline|sidecar] \
[--reload] [--log-level LEVEL]
| Flag |
Type |
Default |
Description |
--host |
str |
0.0.0.0 |
Bind address. Use 127.0.0.1 to restrict to localhost. |
--port |
int |
8080 |
TCP port. |
--workers |
int |
1 |
Number of uvicorn worker processes. With >1 the model would normally load N times — combine with --embedder or --embedder-mode sidecar to share a single embedder across workers. Incompatible with --reload. |
--embedder |
str |
unset |
Sidecar URL (unix:///abs/path.sock or tcp://host:port). Workers inject a SidecarEmbedderClient instead of loading the model in-process. The daemon must already be running (python -m semvec.embedder --listen ...). |
--embedder-mode |
inline / sidecar |
inline |
inline: each worker loads its own model. sidecar: semvec serve spawns one embedder daemon, waits for READY, then starts the API workers and points them at it via UDS. Best for multi-worker deployments. |
--reload |
bool flag |
off |
Enable uvicorn auto-reload on source change. Development only. |
--log-level |
critical / error / warning / info / debug |
info |
Log level for both uvicorn and the semvec application. |
The server loads semvec.api:create_app via uvicorn's --factory
mode, so every process creates its own SessionManager,
ClusterManager, etc. (state is in-memory and therefore per-worker —
see the REST API for the SQLite metadata schema used for
cross-worker persistence).
python -m semvec.embedder (sidecar daemon)
Stand-alone embedder daemon. The API workers connect to it over UDS
or TCP. Use this when you want to scale API workers independently of
the embedder, or run the embedder on a different host / GPU.
python -m semvec.embedder --listen unix:///run/semvec/embedder.sock \
--model all-MiniLM-L6-v2
| Flag |
Type |
Default |
Description |
--listen |
str |
required |
unix:///abs/path.sock (Linux/macOS) or tcp://host:port (Windows-friendly). |
--model |
str |
all-MiniLM-L6-v2 |
Any sentence-transformers model name. |
--dimension |
int |
384 |
Output dimension — must match the model. |
--batch-max |
int |
32 |
Max texts coalesced per encode call. |
--batch-wait-ms |
float |
5.0 |
Max wait (ms) for a batch to fill. Lower = lower latency, higher = better GPU utilisation. |
--ready-fd |
int |
unset |
Inheritable fd to write READY\n on once the listener is accepting. Used by semvec serve --embedder-mode sidecar for the parent/child handshake. |
--log-level |
critical / error / warning / info / debug |
info |
Daemon log level. |
The daemon installs SIGTERM / SIGINT handlers that drain in-flight
batches before exit. Clients that lose the connection during drain
receive a clean error and can reconnect once the daemon restarts.
Environment variables read at start-up
All variables below are read once per worker at process start. Change a value, then restart semvec serve for it to take effect. Grouped by concern.
Server & auth
| Variable |
Default |
Purpose |
DATABASE_URL |
sqlite:///semvec.db |
SQLAlchemy URL for the session / cluster / audit metadata store. |
CORS_ORIGINS |
empty (no cross-origin access) |
Comma-separated list of allowed origins, e.g. https://app.example.com,http://localhost:5173. When unset, the CORS middleware is skipped entirely for a small per-request win. |
SEMVEC_LICENSE_KEY |
— |
Ed25519-signed license JWT (Pro / Enterprise features). |
SEMVEC_ALLOW_ANONYMOUS |
unset |
Set to 1 to bypass license verification — development only, every request is treated as anonymous community-tier. |
METRICS_USER / METRICS_PASSWORD |
— |
Basic Auth for the /metrics endpoint. Must both be set to enable the endpoint. |
Session lifecycle
| Variable |
Default |
Purpose |
SEMVEC_MAX_SESSIONS |
10000 |
Hard cap on concurrent sessions per worker. Oldest-touched sessions are evicted on overflow. |
SEMVEC_SESSION_IDLE_TTL_S |
1800 (30 min) |
Sessions untouched for this long are evicted by the background sweeper. Set to 0 to disable. |
SEMVEC_SESSION_SWEEP_S |
60 |
How often the background task scans for idle sessions. Set to 0 to disable the sweeper entirely (useful in tests). |
Embedder
| Variable |
Default |
Purpose |
SEMVEC_EMBEDDER_URL |
unset |
Same effect as --embedder. When set, the lifespan injects a SidecarEmbedderClient instead of loading the model in-process. Read by every worker. |
SEMVEC_EMBEDDER_MODEL |
all-MiniLM-L6-v2 |
Default sentence-transformers model name the sidecar daemon loads when --model is not provided. Override per-deployment. |
SEMVEC_EMBEDDER_DIM |
384 |
Output dimension expected from the sidecar; must match the model the daemon was launched with. |
SEMVEC_EMBEDDER_CACHE_SIZE |
0 (disabled) |
When >0, wraps the injected embedder in a CachedEmbedder with this LRU capacity. Cache hits skip the model; concurrent submits for the same text dedup onto one underlying encode. Cheapest path to ×2–×3 RPS on chat traffic. |
SEMVEC_USE_RUST_EMBEDDER / SEMVEC_EMBEDDER_BIN |
unset |
Opt-in switches that make --embedder-mode sidecar spawn the Rust semvec-embedder binary instead of the Python daemon. See Embedders guide. |
Retrieval (/v1/run)
| Variable |
Default |
Purpose |
SEMVEC_RUN_TOP_K |
5 |
Number of memories surfaced per /v1/run (used by the context block, short-circuit, and drift scoring). Raising it catches lexically-distant facts; lowering it keeps prompts tight. |
SEMVEC_MMR_FETCH_K |
0 (disabled) |
When > SEMVEC_RUN_TOP_K, fetch this many candidates and Maximal-Marginal-Relevance rerank down to SEMVEC_RUN_TOP_K. Demotes near-duplicate memories so diverse facts survive into the final set. 50–200 is a good starting range. |
SEMVEC_MMR_LAMBDA |
0.5 |
MMR relevance/diversity mix. 1.0 = pure cosine retrieval (no diversity), 0.0 = pure diversity (no relevance). |
SEMVEC_CONTEXT_BUDGET_CHARS |
4000 |
Total character budget for the context string returned by /v1/run, packed sum-as-you-go across retrieved memories. Replaces the legacy per-memory 150-char cap. Long memories use what they need; short ones don't waste budget. |
BM25-hybrid retrieval (opt-in, needs semvec[hybrid])
| Variable |
Default |
Purpose |
SEMVEC_HYBRID_BM25 |
0 (off) |
Master switch. When 1, every session also maintains a per-session BM25 index and /v1/run fuses dense + lexical candidates via Reciprocal Rank Fusion. |
SEMVEC_BM25_FETCH_K |
50 |
BM25 top-K fed into the fusion. |
SEMVEC_BM25_REBUILD_EVERY |
64 |
Ingests between snapshot rebuilds of the per-session BM25 index. Lower = fresher BM25 at higher rebuild cost. |
SEMVEC_RRF_K |
60 |
RRF smoothing constant. The standard value from the RRF paper; rarely worth changing. |
SEMVEC_RRF_WEIGHTS |
unset (uniform) |
Comma-separated per-list weights, e.g. "1.0,0.4" to halve the BM25 contribution. Useful when BM25 hurts single-fact precision. |
Cross-encoder rerank (opt-in)
| Variable |
Default |
Purpose |
SEMVEC_RERANK_MODEL |
unset (off) |
HuggingFace model ID, e.g. cross-encoder/ms-marco-MiniLM-L-6-v2. When set, /v1/run reranks the BM25 / dense fusion output through this cross-encoder before returning the final top-K. |
SEMVEC_RERANK_FETCH_K |
50 |
Candidate pool fed into the cross-encoder. |
SEMVEC_RERANK_BATCH |
64 |
Cross-encoder batch size. Tune against the GPU/CPU running the worker. |
SEMVEC_RERANK_FP16 |
0 |
Set 1 for FP16 inference on GPU — typically 1.5–2× faster with no observable quality loss. |
SEMVEC_RERANK_THREADS |
os.cpu_count() |
Torch intra-op thread cap for CPU inference. Set lower if you co-locate the API with other CPU-heavy tasks. |
| Variable |
Default |
Purpose |
SEMVEC_TOPIC_SWITCH |
1 |
Master switch for the topic-switch detector. 0 disables — useful for parity tests that must hold the state still. |
PSS_TOPIC_SWITCH |
— |
Deprecated legacy alias for SEMVEC_TOPIC_SWITCH read as a fallback by the session manager; scheduled for removal in 1.0. Prefer the SEMVEC_*-prefixed variable. |
SEMVEC_AUTO_ANCHOR_ON_TOPIC_SWITCH |
0 |
Set 1 to snapshot semantic_state as a fresh anchor every time the detector fires. Capped by SEMVEC_MAX_AUTO_ANCHORS. |
SEMVEC_AUTO_ANCHOR_FROM_EXTRACT |
0 |
Set 1 to also create anchors from extracted-entity embeddings (when auto-extract is on). |
SEMVEC_MAX_AUTO_ANCHORS |
8 |
Cap on the number of anchors created via either auto-anchor path. |
SEMVEC_AUTO_EXTRACT |
0 |
Set 1 to enable best-effort numeric / entity extraction from ingested text. |
SEMVEC_AUTO_EXTRACT_BROAD |
0 |
Broader extractor profile (more recall, more noise). Implies SEMVEC_AUTO_EXTRACT=1. |
SEMVEC_ENABLE_NUMERIC_EXTRACTOR |
1 |
Set 0 to disable the numeric extractor (IBAN, amounts, IDs) — useful when downstream code does its own extraction. |
Compliance & event store (semvec[compliance])
| Variable |
Default |
Purpose |
SEMVEC_ENABLE_EVENT_STORE |
0 |
Set 1 to write every state mutation into the append-only event store. Required for deterministic replay and signed deletion certificates. |
SEMVEC_ENABLE_HMAC_SIGNING |
0 |
Set 1 to sign every event-store entry with HMAC for tamper-evidence. Requires a key configured in the compliance config. |
SEMVEC_ENABLE_RS256_JWT |
0 |
Set 1 to issue RS256-signed user JWTs from the compliance routes (vs HS256). Requires a private key. |
SEMVEC_ENABLE_RETENTION_SWEEPER |
0 |
Set 1 to run the background retention sweeper that deletes events older than the configured retention horizon. |
SEMVEC_RETENTION_DAYS_AUDIT |
2557 (≈ 7 years) |
Retention horizon for audit events. |
SEMVEC_RETENTION_DAYS_CHAT |
365 |
Retention horizon for chat events. |
SEMVEC_COMPLIANCE_PUBKEY_FILE |
unset |
Path to the compliance verifier public key (PEM). Used to verify signed deletion certificates and RS256 JWTs. |
SEMVEC_COMPLIANCE_PUBKEY_PEM |
unset |
Inline PEM alternative to …_FILE. |
SEMVEC_COMPLIANCE_PRIVKEY_FILE |
unset |
Path to the compliance signing private key. Issuer side only — never set on verifying instances. |
SEMVEC_COMPLIANCE_PRIVKEY_PEM |
unset |
Inline PEM alternative to …_FILE. |
Licensing internals
| Variable |
Default |
Purpose |
SEMVEC_LICENSE_KEY |
— |
Ed25519-signed license JWT. Required for Pro / Enterprise features and quotas. |
SEMVEC_LICENSE_LRU_SIZE |
256 |
LRU cache size for verified JWTs. Higher = more memory, fewer signature verifies per second. |
API process
| Variable |
Default |
Purpose |
SEMVEC_API_THREADPOOL |
200 |
Size of the asyncio default executor thread pool. Cap that bounds in-flight blocking work. |
SEMVEC_STATE_DIR |
.semvec |
Default directory CodingEngine and adjacent components use for persistent state. |
The API contract version matches semvec.__version__ from the
installed wheel (informational; not a runtime knob).
Build-time-only environment variables (wheel builders only)
Not consumed at runtime
The variables below are read only when building semvec from
source (wheel / sdist construction). Setting them on a running
semvec serve process has no effect — the installed wheel
already has the relevant values baked in.
| Variable |
Default |
Purpose |
SEMVEC_BASE_URL |
unset |
Public base URL baked into the built artefact for absolute-link generation. |
SEMVEC_EMBEDDED_PUBKEY_PATH |
build-time baked |
Override the embedded verifier public-key path picked up by the build. |
SEMVEC_PROD_PUBKEY_FILE |
— |
Path to a production public-key bundle. Used by the build to bake the correct verifier. |
SEMVEC_PROD_PUBKEY_PEM |
— |
Inline PEM alternative to …_FILE. |
SEMVEC_BUILD_ALLOW_DEV_KEY |
0 |
Set 1 to allow the dev verifier in a release build (refused by default). |
SEMVEC_COMPLIANCE_PUBKEY_TARGET |
unset |
Path the build rewrites with the latest pubkey when rotating from a hot key registry. |
Examples
Local development
export SEMVEC_ALLOW_ANONYMOUS=1
export DATABASE_URL="sqlite:///dev.db"
semvec serve --host 127.0.0.1 --port 8080 --reload --log-level debug
Production behind a reverse proxy
export DATABASE_URL="postgresql://semvec:pass@db/semvec"
export CORS_ORIGINS="https://app.example.com"
export METRICS_USER="prom"
export METRICS_PASSWORD="$(cat /run/secrets/metrics_password)"
semvec serve --host 0.0.0.0 --port 8080 --log-level info
Behind nginx / an ALB, the server trusts the X-Forwarded-For and
X-Real-IP headers for client-IP resolution (used by the rate
limiter and the audit log).
Programmatic start (without the CLI)
python -m uvicorn semvec.api:create_app --factory --host 0.0.0.0 --port 8080
Same effect, handy when you want to wire the factory into a larger
ASGI app (e.g. mounted under a prefix).
See also