Compliance Pack — Enterprise / Regulated Industries¶
Install¶
# Library-only use (event store, extractors, certificates,
# HMAC signing — no FastAPI):
pip install "semvec[compliance]"
# When you want the FastAPI router + middleware too:
pip install "semvec[api,compliance]"
The [compliance] extra pulls in cryptography>=42 for the
DeletionCertificate signer and the RS256 user-JWT verifier. The
FastAPI router (compliance_router) and middleware
(ComplianceHmacMiddleware) live in semvec.api.* and need the
heavier [api] extra (FastAPI, SQLAlchemy, prometheus).
The Compliance Pack adds the cryptographic verification, retention,
and selective-deletion layers that regulated tenants need on top of
the base SemvecState. Every feature ships behind a
SEMVEC_ENABLE_* environment variable, defaulting to off, so an
existing deployment that imports semvec does not pick up new
behaviour by accident.
Where each capability lives¶
The pack splits across three layers — the in-process library, the
FastAPI surface, and the operator/cron layer. Pick [compliance]
alone for library-only use; combine with [api] for the
HTTP-served path.
| Capability | Layer | Extra needed |
|---|---|---|
| Append-only event store | Library — semvec.compliance.event_store |
[compliance] |
| Deterministic replay | Library — semvec.compliance.event_replay |
[compliance] |
ComplianceState proxy |
Library — semvec.compliance.state_proxy |
[compliance] |
| Numeric / date / IBAN extractors | Library — semvec.compliance.extractors |
[compliance] |
DeletionCertificate signer |
Library — semvec.compliance.deletion_cert |
[compliance] |
| RS256 user-JWT verifier + registry | Library — semvec.compliance.rs256 + key_registry |
[compliance] |
| HMAC request signer (client side) | Library — semvec.compliance.hmac_signing |
[compliance] |
| Async vector-rebuild worker | Library — semvec.compliance.workers.vector_rebuild |
[compliance] |
| 30-day retention sweeper | Cron — semvec.compliance.retention |
[compliance] (run via OS cron / k8s CronJob) |
GDPR Art. 17 /forget REST route |
API — semvec.api.routers.compliance_router |
[api,compliance] |
| HMAC request middleware (server) | API — semvec.api.middleware.compliance_auth.ComplianceHmacMiddleware |
[api,compliance] |
| RS256-secured user routes | API — semvec.api.middleware.compliance_auth |
[api,compliance] |
The library layer works without FastAPI — you can call it from a
worker, a CLI, or a background task in any framework. Compliance
routes are wired unconditionally when semvec[compliance] is
installed; there is no separate enable flag in 0.6.1. The HMAC
middleware and RS256 verifier still require their own runtime envs
(keys, registry); see REST API for the full
endpoint catalogue and CLI for the runtime envs.
What's in the pack¶
| Capability | Module | Why |
|---|---|---|
| Append-only event store | semvec.compliance.event_store |
The 3-tier memory + EMA vector + literal cache become derived views — rebuildable from the events at any time. Single source of truth for what shaped the state. |
| Deterministic replay | semvec.compliance.event_replay |
Two replays of the same event stream produce bit-identical semantic_states. Required for audit re-construction and after-deletion rebuilds. |
| Automatic 30-day retention | semvec.compliance.retention |
Cron-friendly sweeper that purges anything older than retention_days and writes an audit record per affected user. |
| GDPR Art. 17 forget | semvec.compliance.retention.forget_user |
Synchronous wipe + signed DeletionCertificate the customer can verify offline. |
| Verbatim-precise facts | semvec.compliance.extractors |
Regex-based numeric / date / identifier extractors. Decimal precision; never roundtrips through float. Includes IBAN mod-97 checksum. |
| HMAC request signing | semvec.compliance.hmac_signing + api.middleware.compliance_auth |
AWS-SigV4-style (METHOD, PATH, SHA256(body), TS, NONCE) canonical, HMAC-SHA256, constant-time verify, replay defence. |
| RS256 user JWT | semvec.compliance.rs256 + key_registry |
Per-user public key registered server-side, private key never leaves the client. The server cannot forge tokens. |
| Async vector rebuild | semvec.compliance.workers.vector_rebuild |
Decouples the post-DELETE replay from the request path — the API endpoint enqueues, the worker rebuilds, the session store gets the new vector. |
Quickstart¶
Wire an event-sourced state¶
from semvec import SemvecConfig
from semvec.compliance.event_store import SqliteEventStore
from semvec.compliance.state_proxy import ComplianceState
store = SqliteEventStore(path="events.sqlite")
store.init_schema()
state = ComplianceState(
SemvecConfig(dimension=384),
event_store=store,
user_id="user-42",
default_meta={"channel": "chat"},
)
# Every successful update appends a MemoryEvent. Failures (dim
# mismatch, isolation reject) propagate without writing.
state.update(my_embedder.get_embedding("Hello"), "Hello")
Extract verbatim facts¶
from semvec.compliance.extractors import extract_facts
text = "Mein Kontostand ist 1.247,38 € am 15.08.2026"
for fact in extract_facts(text):
print(fact.kind, fact)
# numeric NumericFact(value=Decimal('1247.38'), unit='EUR', ...)
# date DateFact(value=datetime(2026, 8, 15, tzinfo=UTC), ...)
Decimal precision is enforced — Decimal('0.1') + Decimal('0.2') == Decimal('0.3') exactly. Float roundtrips are forbidden.
Run the retention sweeper¶
from semvec.compliance.retention import RetentionSweeper
report = RetentionSweeper(store=store).sweep(retention_days=30)
print(report.deleted_total, report.deleted_per_user)
Idempotent — a second call with the same retention window is a no-op.
Issue a signed DeletionCertificate (GDPR Art. 17)¶
from semvec.compliance.audit import InMemoryAuditLog
from semvec.compliance.retention import forget_user
cert = forget_user(
user_id="user-42",
store=store,
audit_log=InMemoryAuditLog(),
issuer="versino-compliance",
)
# Customer-side verification (offline):
from semvec.compliance.certificates import verify_certificate
assert verify_certificate(cert) # uses the wheel-embedded pubkey
The certificate's
reasonfield is server-controlled. ThePOST /v1/compliance/users/{uid}/forgetHTTP endpoint always writesreason="user_request"into the signed payload, even if the request body carries a different value (e.g. a"reason":"user_request_dsgvo_art17"). This is intentional — the signed certificate is an attestation issued by the operator, so an arbitrary user-supplied string in there would dilute its evidentiary value. Use theforget_user()Python API directly if you need a custom reason (e.g.ttl_expiredfrom a sweeper).
The server-controlled reason field exists to keep the audit
chain deterministic: a data subject cannot manipulate the
recorded legal basis after the fact. Where you need to preserve
the caller-supplied legal basis (e.g. distinguishing
"Art. 17 erasure" from "Art. 7(3) consent withdrawal"), the
recommended transitional path is to capture the caller's stated
basis in your application's outbox (request log, ticket, DPIA
artefact) before calling /v1/compliance/users/{uid}/forget,
and reconcile it against the signed certificate when your DPO
audits the erasure trail. A future revision of the endpoint
plans to accept a user_provided_reason body field that is
stored in the audited request log alongside the signed certificate;
until then, capture the basis client-side. Only the signed
certificate field is server-controlled.
The wheel ships with the operator's RSA-3072 public key embedded at
build time (set the SEMVEC_COMPLIANCE_PUBKEY_PEM repository
secret in CI). Customers can verify the certificate without any
configuration. Operators on a self-managed deployment override the
key via SEMVEC_COMPLIANCE_PUBKEY_FILE or SEMVEC_COMPLIANCE_PUBKEY_PEM.
Signing algorithm: RSA-PSS-SHA256 with MGF1. Despite the Ed25519-signed license JWT in the licensing system, the
sign_certificate/verify_certificatepair uses RSA-PSS-SHA256 with an MGF1-SHA256 mask. This was chosen over Ed25519 for the certificate path because PKCS#11 HSMs and compliance-team key-management tooling have ubiquitous RSA support but spotty Ed25519 support in deployments older than 2024. Signature size is ~256 bytes (vs Ed25519's 64). A 0.5.x roadmap item adds optional Ed25519 with auto-detection based on the registered key algorithm.
Sign HTTP requests against the server¶
from semvec.compliance.hmac_signing import sign_request
from datetime import UTC, datetime
import secrets
body = b'{"reason":"user_request"}'
ts = datetime.now(UTC).isoformat()
nonce = secrets.token_hex(16)
signature = sign_request(
secret=my_hmac_secret,
method="POST",
path="/v1/compliance/users/user-42/forget",
body=body,
timestamp=ts,
nonce=nonce,
)
headers = {
"X-Semvec-User-Id": "user-42",
"X-Semvec-Key-Id": my_kid,
"X-Semvec-Timestamp": ts,
"X-Semvec-Nonce": nonce,
"X-Semvec-Signature": signature,
}
Sign the path, not the URL. The middleware verifies against
request.url.pathonly — the query string is not part of the canonical request. ForGET /v1/compliance/users/user-42/facts?type=numericthe signing path is/v1/compliance/users/user-42/facts. Hitting the URL with the query string baked into the signed path produces a401 bad_signature.Practical consequence: do not put tamper-relevant input in the query string (
?action=delete-style toggles). Filters that only shape the response (?type=numeric) are fine — the worst a MitM can do is change the filter on a read-only request. A future release may include the canonical query string in the signed payload (AWS-SigV4 §3.2.4 style), which would be a breaking change to client signers; current call sites should keep query parameters read-only-shape.
Mount the FastAPI middleware¶
from fastapi import FastAPI
from semvec.api.compliance_routes import (
compliance_router,
set_compliance_dependencies,
)
from semvec.api.middleware.compliance_auth import ComplianceHmacMiddleware
from semvec.compliance.audit import InMemoryAuditLog
from semvec.compliance.event_store import SqliteEventStore
from semvec.compliance.key_registry import InMemoryKeyRegistry
from semvec.compliance.nonce_cache import InMemoryNonceCache
store = SqliteEventStore(path="events.sqlite")
store.init_schema()
registry = InMemoryKeyRegistry()
nonce_cache = InMemoryNonceCache(window_seconds=60)
# KeyRegistry mutating methods are keyword-only — register/rotate/
# revoke have no positional arguments. The compiler enforces it; the
# kwarg-only signature also keeps the audit log readable when key
# rotations show up in trace output:
# registry.register(user_id="alice", key_id="k1", public_key_pem=pem)
# registry.rotate(user_id="alice", new_key_id="k2", new_public_key_pem=pem2)
# registry.revoke(user_id="alice", key_id="k1")
app = FastAPI()
app.add_middleware(
ComplianceHmacMiddleware,
registry=registry,
nonce_cache=nonce_cache,
protected_prefix="/v1/compliance",
)
set_compliance_dependencies(store=store, audit_log=InMemoryAuditLog())
app.include_router(compliance_router)
Failure modes the middleware enforces:
missing_signature— requiredX-Semvec-*headers absent.timestamp_out_of_window— clock skew exceeds the configured window.unknown_key— user/key pair not in the registry.user_id_mismatch— signed user-id does not match the path's user-id.bad_signature— HMAC verify failed.nonce_replayed— same nonce already observed in the window (HTTP 409).
Runtime configuration¶
# Feature flags — every one defaults to off.
export SEMVEC_ENABLE_EVENT_STORE=1
export SEMVEC_ENABLE_RETENTION_SWEEPER=1
export SEMVEC_ENABLE_HMAC_SIGNING=1
export SEMVEC_ENABLE_RS256_JWT=1
export SEMVEC_ENABLE_NUMERIC_EXTRACTOR=1
# Retention windows.
export SEMVEC_RETENTION_DAYS_CHAT=30 # default 30
export SEMVEC_RETENTION_DAYS_AUDIT=2555 # default ~ 7 years
# DeletionCertificate keys.
export SEMVEC_COMPLIANCE_PRIVKEY_FILE=/path/to/compliance.priv.pem
# (Operators only; the matching public key is embedded in the wheel.)
Architecture notes¶
- Event store is authoritative; everything else is derived. A reset of the semantic state or the memory tiers does not lose information — replay rebuilds them. A delete in the event store is the only way to genuinely forget something.
- Replay does not consume rate-limit budget. The replay path uses an internal accessor that bypasses the per-state community-tier limiter. Public
update()keeps the limiter applied. - HMAC verify is constant-time. Malformed signatures (wrong length, non-hex chars) return
Falseinstead of raising — never let a parser error escalate to a panic. - Body verify, then nonce. The middleware verifies the HMAC signature before it consumes the nonce. A bad signature on a legitimate retry does not lock out the genuine retry from re-using the same nonce.
Limitations¶
- In-memory backends only by default. SQLite event store is
fine for single-process / development / small deployments. For
multi-replica production, swap in a Postgres + pgvector backend
(the
EventStoreABC pins the contract) and replaceInMemoryNonceCachewith a Redis or Postgres-backed cache. Both swaps are half-day ports against the existing tests. - HMAC secret bootstrap is on you. The Compliance Pack does not ship a "first key registration" flow. Customers exchange the HMAC secret with you out-of-band when they get their license JWT.
- Replay can be slow on huge corpora. Re-folding a million events
through
SemvecState.update()is O(N). The async worker keeps the request path snappy, but the rebuild itself is still N steps. Future work: a merge-friendly checkpoint format that lets replays start from a snapshot.