Compliance Pack — Enterprise / Regulated Industries¶
Install¶
# Library-only use (event store, extractors, certificates,
# HMAC signing — no FastAPI):
pip install "semvec[compliance]"
# When you want the FastAPI router + middleware too:
pip install "semvec[api,compliance]"
The [compliance] extra pulls in cryptography>=42 for the
DeletionCertificate signer and the RS256 user-JWT verifier. The
FastAPI router (compliance_router) and middleware
(ComplianceHmacMiddleware) live in semvec.api.* and need the
heavier [api] extra (FastAPI, slowapi, SQLAlchemy, prometheus).
The Compliance Pack adds the cryptographic verification, retention,
and selective-deletion layers that regulated tenants need on top of
the base SemvecState. Every feature ships behind a
SEMVEC_ENABLE_* environment variable, defaulting to off, so an
existing deployment that imports semvec does not pick up new
behaviour by accident.
What's in the pack¶
| Capability | Module | Why |
|---|---|---|
| Append-only event store | semvec.compliance.event_store |
The 3-tier memory + EMA vector + literal cache become derived views — rebuildable from the events at any time. Single source of truth for what shaped the state. |
| Deterministic replay | semvec.compliance.event_replay |
Two replays of the same event stream produce bit-identical semantic_states. Required for audit re-construction and after-deletion rebuilds. |
| Automatic 30-day retention | semvec.compliance.retention |
Cron-friendly sweeper that purges anything older than retention_days and writes an audit record per affected user. |
| GDPR Art. 17 forget | semvec.compliance.retention.forget_user |
Synchronous wipe + signed DeletionCertificate the customer can verify offline. |
| Verbatim-precise facts | semvec.compliance.extractors |
Regex-based numeric / date / identifier extractors. Decimal precision; never roundtrips through float. Includes IBAN mod-97 checksum. |
| HMAC request signing | semvec.compliance.hmac_signing + api.middleware.compliance_auth |
AWS-SigV4-style (METHOD, PATH, SHA256(body), TS, NONCE) canonical, HMAC-SHA256, constant-time verify, replay defence. |
| RS256 user JWT | semvec.compliance.rs256 + key_registry |
Per-user public key registered server-side, private key never leaves the client. The server cannot forge tokens. |
| Async vector rebuild | semvec.compliance.workers.vector_rebuild |
Decouples the post-DELETE replay from the request path — the API endpoint enqueues, the worker rebuilds, the session store gets the new vector. |
Quickstart¶
Wire an event-sourced state¶
from semvec import SemvecConfig
from semvec.compliance.event_store import SqliteEventStore
from semvec.compliance.state_proxy import ComplianceState
store = SqliteEventStore(path="events.sqlite")
store.init_schema()
state = ComplianceState(
SemvecConfig(dimension=384),
event_store=store,
user_id="user-42",
default_meta={"channel": "chat"},
)
# Every successful update appends a MemoryEvent. Failures (dim
# mismatch, isolation reject) propagate without writing.
state.update(my_embedder.get_embedding("Hello"), "Hello")
Extract verbatim facts¶
from semvec.compliance.extractors import extract_facts
text = "Mein Kontostand ist 1.247,38 € am 15.08.2026"
for fact in extract_facts(text):
print(fact.kind, fact)
# numeric NumericFact(value=Decimal('1247.38'), unit='EUR', ...)
# date DateFact(value=datetime(2026, 8, 15, tzinfo=UTC), ...)
Decimal precision is enforced — Decimal('0.1') + Decimal('0.2') == Decimal('0.3') exactly. Float roundtrips are forbidden.
Run the retention sweeper¶
from semvec.compliance.retention import RetentionSweeper
report = RetentionSweeper(store=store).sweep(retention_days=30)
print(report.deleted_total, report.deleted_per_user)
Idempotent — a second call with the same retention window is a no-op.
Issue a signed DeletionCertificate (GDPR Art. 17)¶
from semvec.compliance.audit import InMemoryAuditLog
from semvec.compliance.retention import forget_user
cert = forget_user(
user_id="user-42",
store=store,
audit_log=InMemoryAuditLog(),
issuer="versino-compliance",
)
# Customer-side verification (offline):
from semvec.compliance.certificates import verify_certificate
assert verify_certificate(cert) # uses the wheel-embedded pubkey
The certificate's
reasonfield is server-controlled. ThePOST /v1/compliance/users/{uid}/forgetHTTP endpoint always writesreason="user_request"into the signed payload, even if the request body carries a different value (e.g. a"reason":"user_request_dsgvo_art17"). This is intentional — the signed certificate is an attestation issued by the operator, so an arbitrary user-supplied string in there would dilute its evidentiary value. Use theforget_user()Python API directly if you need a custom reason (e.g.ttl_expiredfrom a sweeper).
The wheel ships with the operator's RSA-3072 public key embedded at
build time (set the SEMVEC_COMPLIANCE_PUBKEY_PEM repository
secret in CI). Customers can verify the certificate without any
configuration. Operators on a self-managed deployment override the
key via SEMVEC_COMPLIANCE_PUBKEY_FILE or SEMVEC_COMPLIANCE_PUBKEY_PEM.
Sign HTTP requests against the server¶
from semvec.compliance.hmac_signing import sign_request
from datetime import UTC, datetime
import secrets
body = b'{"reason":"user_request"}'
ts = datetime.now(UTC).isoformat()
nonce = secrets.token_hex(16)
signature = sign_request(
secret=my_hmac_secret,
method="POST",
path="/v1/compliance/users/user-42/forget",
body=body,
timestamp=ts,
nonce=nonce,
)
headers = {
"X-Semvec-User-Id": "user-42",
"X-Semvec-Key-Id": my_kid,
"X-Semvec-Timestamp": ts,
"X-Semvec-Nonce": nonce,
"X-Semvec-Signature": signature,
}
Sign the path, not the URL. The middleware verifies against
request.url.pathonly — the query string is not part of the canonical request. ForGET /v1/compliance/users/user-42/facts?type=numericthe signing path is/v1/compliance/users/user-42/facts. Hitting the URL with the query string baked into the signed path produces a401 bad_signature.Practical consequence: do not put tamper-relevant input in the query string (
?action=delete-style toggles). Filters that only shape the response (?type=numeric) are fine — the worst a MitM can do is change the filter on a read-only request. A future release may include the canonical query string in the signed payload (AWS-SigV4 §3.2.4 style), which would be a breaking change to client signers; current call sites should keep query parameters read-only-shape.
Mount the FastAPI middleware¶
from fastapi import FastAPI
from semvec.api.compliance_routes import (
compliance_router,
set_compliance_dependencies,
)
from semvec.api.middleware.compliance_auth import ComplianceHmacMiddleware
from semvec.compliance.audit import InMemoryAuditLog
from semvec.compliance.event_store import SqliteEventStore
from semvec.compliance.key_registry import InMemoryKeyRegistry
from semvec.compliance.nonce_cache import InMemoryNonceCache
store = SqliteEventStore(path="events.sqlite")
store.init_schema()
registry = InMemoryKeyRegistry()
nonce_cache = InMemoryNonceCache(window_seconds=60)
app = FastAPI()
app.add_middleware(
ComplianceHmacMiddleware,
registry=registry,
nonce_cache=nonce_cache,
protected_prefix="/v1/compliance",
)
set_compliance_dependencies(store=store, audit_log=InMemoryAuditLog())
app.include_router(compliance_router)
Failure modes the middleware enforces:
missing_signature— requiredX-Semvec-*headers absent.timestamp_out_of_window— clock skew exceeds the configured window.unknown_key— user/key pair not in the registry.user_id_mismatch— signed user-id does not match the path's user-id.bad_signature— HMAC verify failed.nonce_replayed— same nonce already observed in the window (HTTP 409).
Runtime configuration¶
# Feature flags — every one defaults to off.
export SEMVEC_ENABLE_EVENT_STORE=1
export SEMVEC_ENABLE_RETENTION_SWEEPER=1
export SEMVEC_ENABLE_HMAC_SIGNING=1
export SEMVEC_ENABLE_RS256_JWT=1
export SEMVEC_ENABLE_NUMERIC_EXTRACTOR=1
# Retention windows.
export SEMVEC_RETENTION_DAYS_CHAT=30 # default 30
export SEMVEC_RETENTION_DAYS_AUDIT=2555 # default ~ 7 years
# DeletionCertificate keys.
export SEMVEC_COMPLIANCE_PRIVKEY_FILE=/path/to/compliance.priv.pem
# (Operators only; the matching public key is embedded in the wheel.)
Architecture notes¶
- Event store is authoritative; everything else is derived. A reset of the EMA vector or the 3-tier memory does not lose information — replay rebuilds them. A delete in the event store is the only way to genuinely forget something.
- Replay never trips the rate limiter. The replay path uses the
internal
_internal_record_replay_step()accessor onSemvecStatewhich skips the per-state community-tier limiter. Publicupdate()keeps the limiter to discourage probing of the update equation. - HMAC verify is constant-time.
subtle::ConstantTimeEqon the Rust side; the Python facade just forwards the bytes. Malformed signatures (wrong length, non-hex chars) returnFalseinstead of raising — never let a parser error escalate to a panic. - Body verify, then nonce. The middleware verifies the HMAC signature before it consumes the nonce. A bad signature on a legitimate retry does not lock out the genuine retry from re-using the same nonce.
Demo script¶
scripts/demo_compliance_pack.py walks every feature end-to-end. Runs
in <2 s against a temporary SQLite store; uses the operator key at
/mnt/c/Versino PsiOmega GmbH/semvec_pypi_private_key/compliance.priv.pem
when available, otherwise mints an ephemeral key just for the demo.
Limitations¶
- In-memory backends only by default. SQLite event store is
fine for single-process / development / small deployments. For
multi-replica production, swap in a Postgres + pgvector backend
(the
EventStoreABC pins the contract) and replaceInMemoryNonceCachewith a Redis or Postgres-backed cache. Both swaps are half-day ports against the existing tests. - HMAC secret bootstrap is on you. The Compliance Pack does not ship a "first key registration" flow. Customers exchange the HMAC secret with you out-of-band when they get their license JWT.
- Replay can be slow on huge corpora. Re-folding a million events
through
SemvecState.update()is O(N). The async worker keeps the request path snappy, but the rebuild itself is still N steps. Future work: a merge-friendly checkpoint format that lets replays start from a snapshot.