Choosing an Embedder¶
semvec is embedder-agnostic — anything exposing get_embedding(text) -> np.ndarray and get_dimension() -> int works. This page collects the tradeoffs for the common options so you pick the right one for your workload.
TL;DR¶
| Profile | Pick |
|---|---|
| Default, CPU-friendly, 384 dim | all-MiniLM-L6-v2 |
| Quality-first, 768 dim, 4× slower | all-mpnet-base-v2 |
| Multilingual, 384 dim | paraphrase-multilingual-MiniLM-L12-v2 |
| Managed API | OpenAI text-embedding-3-small (1536 dim) |
| Fastest prod, quantised | ONNX-export of all-MiniLM-L6-v2 at int8 |
All four SentenceTransformer models produce normalised unit vectors out of the box. OpenAI returns unnormalised — normalise before passing to semvec.
SentenceTransformers — local¶
all-MiniLM-L6-v2 (default)¶
- 384 dim, 23 MB download
- ~14k sentences/sec on CPU, ~40k on a GPU
- Trained on a diverse mix of 1B sentence pairs
- Default in benchmark runners
from sentence_transformers import SentenceTransformer
import numpy as np
class STEmbedder:
def __init__(self, name: str = "all-MiniLM-L6-v2", device: str = "cpu"):
self._m = SentenceTransformer(name, device=device)
self._dim = int(self._m.get_sentence_embedding_dimension() or 384)
def get_dimension(self) -> int:
return self._dim
def get_embedding(self, text: str):
if not text.strip():
return np.zeros(self._dim, dtype=np.float64)
vec = self._m.encode(
text, normalize_embeddings=True,
show_progress_bar=False, convert_to_numpy=True,
)
return np.asarray(vec, dtype=np.float64)
all-mpnet-base-v2¶
- 768 dim, 420 MB download
- ~3k sentences/sec on CPU
- ~2-3 pp higher retrieval accuracy on long-form benchmarks
Swap the model name in the wrapper above. Pass dimension=768 everywhere semvec takes one (SemvecConfig.dimension, etc.).
Multilingual¶
paraphrase-multilingual-MiniLM-L12-v2 (384 dim, 117 MB) covers 50+ languages. Use when your conversation histories are not English-only.
OpenAI text-embedding-3-*¶
Managed endpoint, no local compute. Costs per token.
import numpy as np
from openai import OpenAI
class OpenAIEmbedder:
def __init__(self, model: str = "text-embedding-3-small"):
self._client = OpenAI()
self._model = model
# 1536 for -small, 3072 for -large
self._dim = 1536 if "small" in model else 3072
def get_dimension(self) -> int:
return self._dim
def get_embedding(self, text: str) -> np.ndarray:
if not text.strip():
return np.zeros(self._dim, dtype=np.float64)
response = self._client.embeddings.create(model=self._model, input=text)
vec = np.asarray(response.data[0].embedding, dtype=np.float64)
# Normalise — OpenAI does not.
norm = np.linalg.norm(vec)
return vec / norm if norm > 1e-8 else vec
Batch requests (the input parameter accepts lists) when possible to cut round-trip cost — semvec itself only needs one vector per call, but your application layer can amortise.
ONNX / quantised for production latency¶
For serverless or edge deployments, export all-MiniLM-L6-v2 to ONNX and quantise to int8:
pip install optimum onnxruntime sentence-transformers
optimum-cli export onnx \
--model sentence-transformers/all-MiniLM-L6-v2 \
--task feature-extraction \
--optimize O3 \
onnx-minilm/
Install
The Python snippet below requires the optimum runtime extras:
pip install "optimum[onnxruntime]" transformers (not bundled with
semvec). The pip install line in the export step above only covers
the export tooling.
import numpy as np
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
class ONNXEmbedder:
def __init__(self, path: str = "onnx-minilm"):
self._tok = AutoTokenizer.from_pretrained(path)
self._model = ORTModelForFeatureExtraction.from_pretrained(path)
self._dim = 384
def get_dimension(self) -> int:
return self._dim
def get_embedding(self, text: str):
if not text.strip():
return np.zeros(self._dim, dtype=np.float64)
inputs = self._tok(text, return_tensors="np", truncation=True, padding=True)
outputs = self._model(**inputs)
# mean-pool + L2 normalise (matches SentenceTransformer default)
vec = outputs.last_hidden_state.mean(axis=1).squeeze().astype(np.float64)
norm = np.linalg.norm(vec)
return vec / norm if norm > 1e-8 else vec
int8 quantisation typically cuts model size by 4× and improves p50 latency by 2-3× on CPU with < 0.5 pp accuracy loss on standard retrieval benchmarks.
Sidecar embedder daemon¶
By default each API worker loads its own model copy. With `--workers
1` that means N model copies, N GPU contexts, and N caches that never share. The sidecar daemon decouples this: one process holds the model, every API worker connects to it over UDS (or TCP) and submits encode requests. The connection is multiplexed — multiple in-flight requests share one socket without blocking each other.
Topology¶
With --embedder tcp://embedder-host:9000 the daemon can also live on
a different host — useful when you want a single GPU node serving an
auto-scaled API tier on CPU instances.
Mode 1 — semvec serve spawns the daemon¶
The simplest setup. semvec serve launches the daemon, waits for the
READY handshake on an inherited fd, then starts the API workers and
points them at the UDS socket.
SIGTERM is forwarded cleanly: workers drain first, then the daemon completes its in-flight batch and exits.
Mode 2 — operator-managed daemon¶
Run the daemon stand-alone (systemd unit, k8s sidecar container, etc.) and point the API at it via URL or env:
python -m semvec.embedder \
--listen unix:///run/semvec/embedder.sock \
--model all-MiniLM-L6-v2 \
--batch-max 32 --batch-wait-ms 5
semvec serve --workers 8 --embedder unix:///run/semvec/embedder.sock
SEMVEC_EMBEDDER_URL is read by every worker as a drop-in for
--embedder, which is useful when you don't want to thread the URL
through your process supervisor.
The embedder sidecar requires msgspec for its wire protocol. It is
not pulled in by the [api] extra in 0.6.1, so install it explicitly
when you run python -m semvec.embedder standalone:
Sidecar environment variables¶
| Variable | Default | Purpose |
|---|---|---|
XDG_RUNTIME_DIR |
/run/user/$UID on systemd-managed Linux hosts |
Socket directory for the embedder sidecar UDS when no explicit --listen path is set. |
HF_HOME |
~/.cache/huggingface |
Root for the HuggingFace cache used by sentence-transformers (model weights, tokenizers). Set this on read-only / multi-tenant hosts so the cache lives on a writable volume. |
Python is the default, Rust is opt-in¶
The supervisor spawns the Python daemon by default in
--embedder-mode sidecar. A native Rust daemon (semvec-embedder,
ONNX-backed) is picked up automatically when either of these is set:
SEMVEC_EMBEDDER_BIN=/abs/path/to/semvec-embedder— explicit override. A missing file at the path falls back to the Python daemon with a warning.SEMVEC_USE_RUST_EMBEDDER=1— opt-in flag. The supervisor then looks forsemvec-embedderonPATH, and finally in<repo>/target/release/semvec-embedderfor dev checkouts.
| Aspect | Python daemon | Rust daemon (opt-in) |
|---|---|---|
| Ships with the wheel | ✓ | — (build from source) |
| Cold start | ~5.8 s (model load + warmup) | ~0.3 s |
| Per-process RSS | ~1.5 GB | ~150 MB |
| Per-daemon throughput on MiniLM | higher (PyTorch cuDNN) | lower (ORT FP32) |
| Model format | sentence-transformers | ONNX (model + tokenizer in HF cache) |
| Best for | shared-host production | autoscalers, edge, spot, RAM-tight VMs |
The two daemons produce byte-identical output vectors (mean-pool + L2-normalise on the same MiniLM weights). Switching between them does not perturb retrieval or downstream LLM behaviour.
When opting in for the first time, populate the HF cache with the ONNX variant once via sentence-transformers:
Install
The ONNX backend ships as an extra:
pip install "sentence-transformers[onnx]" (not bundled with
semvec). Without the extra, backend="onnx" raises at import time.
from sentence_transformers import SentenceTransformer
SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
The Rust daemon reads tokenizer.json and onnx/model.onnx straight
from $HF_HOME/hub/... — no separate download step.
Retrieval tuning at /v1/run¶
The context block returned by /v1/run is the prompt-side knob most
benchmarks end up sweeping. The env vars below expose every stage
without touching source — set them in the worker environment, restart
semvec serve, done. Defaults keep the pipeline identical to 0.5.6.
Core retrieval¶
| Variable | Default | What it does |
|---|---|---|
SEMVEC_RUN_TOP_K |
5 |
How many memories surface per request. Raise it (15–25) for long-conversation recall queries; lower it for tight prompts. |
SEMVEC_MMR_FETCH_K |
0 (off) |
Fetch this many candidates and MMR-rerank down to SEMVEC_RUN_TOP_K. Stops the final set from being filled with near-duplicates (e.g. five mentions of the same change request). Try 50; bump to 200 when the embedding model returns weak top-K. |
SEMVEC_MMR_LAMBDA |
0.5 |
MMR relevance/diversity mix. 1.0 = pure cosine; 0.0 = pure diversity. 0.5 is a safe default. |
SEMVEC_CONTEXT_BUDGET_CHARS |
4000 |
Total characters of memory text packed into the context string, sum-as-you-go. Replaces the legacy per-memory 150-char cap, which mis-allocated budget: short memories wasted it, long ones got their key facts clipped. |
BM25-hybrid (opt-in)¶
pip install "semvec[hybrid]" to pull in bm25s + nltk.
| Variable | Default | What it does |
|---|---|---|
SEMVEC_HYBRID_BM25 |
0 (off) |
Master switch. When 1, every session maintains a per-session BM25 index alongside dense vectors and /v1/run fuses both candidate lists via Reciprocal Rank Fusion before the next stage. |
SEMVEC_BM25_FETCH_K |
50 |
BM25 top-K fed into the fusion. Larger pools improve recall on long sessions at small index-cost. |
SEMVEC_BM25_REBUILD_EVERY |
64 |
Ingests between snapshot rebuilds of the per-session index. Lower = fresher BM25 at higher rebuild cost. |
SEMVEC_RRF_K |
60 |
RRF smoothing constant. Standard value from the original RRF paper; rarely worth changing. |
SEMVEC_RRF_WEIGHTS |
unset (uniform) | Comma-separated per-list weights, e.g. "1.0,0.4" to halve the BM25 contribution. Useful when BM25 hurts single-fact precision on your domain — keep dense at 1.0 and dial BM25 down until precision recovers. |
Cross-encoder rerank (opt-in)¶
| Variable | Default | What it does |
|---|---|---|
SEMVEC_RERANK_MODEL |
unset (off) | HuggingFace model ID, e.g. cross-encoder/ms-marco-MiniLM-L-6-v2. Activating it adds a rerank stage between RRF fusion and the final top-K. |
SEMVEC_RERANK_FETCH_K |
50 |
Candidate pool fed into the cross-encoder. The reranker selects the best SEMVEC_RUN_TOP_K out of this pool. |
SEMVEC_RERANK_BATCH |
64 |
Cross-encoder batch size. Tune against your GPU/CPU. |
SEMVEC_RERANK_FP16 |
0 |
Set 1 for FP16 inference on GPU. Roughly 1.5–2× faster, no observable quality loss. |
SEMVEC_RERANK_THREADS |
os.cpu_count() |
Torch intra-op thread cap for CPU inference. |
Best LOCOMO numbers (F1 0.495, +2.6 pp over dense-only baseline) come
from the full stack at once: BM25-hybrid on, cross-encoder reranking
on, MMR off, mpnet-base-v2 as the embedder. See
benchmarks/running.md for the exact
reproduce command and .env template.
Embedding cache (built-in)¶
semvec ships with a content-addressed LRU cache + in-flight dedup wrapper that works with any embedder (in-process, sidecar, OpenAI, ONNX). Enable it via env var — the lifespan wraps whatever embedder is currently injected:
What it does:
- Cache hits skip the embedder entirely. A dict lookup + a copy replaces the GPU round-trip.
- In-flight dedup. When N callers submit the same text while an encode is already in flight, they share that one Future instead of racing to populate the same cache slot.
- LRU eviction. Recency-of-access bumps to the front;
popitem(last=False)evicts the oldest entry on overflow.
Why this matters in practice: chat traffic has heavy repetition
(system prompts, common queries, retried turns, the previous LLM
response sent back in the next /v1/run). On a 100-phrase pool
benchmark the cache lifted end-to-end RPS by ~2.9× without buying
more GPU. Sizing: 10 000 entries at 384-d float64 is roughly
30 MB — cheap. Bump it higher for long-tail traffic.
You can also wrap manually if you're embedding the API in your own process:
from semvec.embedder.cache import CachedEmbedder
cached = CachedEmbedder(my_inner_embedder, max_size=10_000)
print(cached.stats()) # {"hits": ..., "misses": ..., "size": ..., "in_flight": ...}
The wrapper exposes the same get_dimension / submit /
get_embedding / shutdown interface as the underlying embedder,
so it's a drop-in replacement.
Determinism matters¶
For regression-testing or benchmark parity:
- Pin the model version (
all-MiniLM-L6-v2on the same SentenceTransformer release across runs). - Use the same
device(CPU is more deterministic than CUDA; CUDA introduces ~1e-6 per-embedding noise). - Disable dropout — all stock SentenceTransformer models are already in
eval()mode, but custom fine-tunes may leak. - Use
temperature=0.0for any downstream LLM calls if you are comparing end-to-end accuracy, not just embeddings.
See benchmarks/parity.md for the documented drift envelope.