Choosing an Embedder¶

semvec is embedder-agnostic — anything exposing get_embedding(text) -> np.ndarray and get_dimension() -> int works. This page collects the tradeoffs for the common options so you pick the right one for your workload.

TL;DR¶

Profile	Pick
Default, CPU-friendly, 384 dim	`all-MiniLM-L6-v2`
Quality-first, 768 dim, 4× slower	`all-mpnet-base-v2`
Multilingual, 384 dim	`paraphrase-multilingual-MiniLM-L12-v2`
Managed API	OpenAI `text-embedding-3-small` (1536 dim)
Fastest prod, quantised	ONNX-export of `all-MiniLM-L6-v2` at int8

All four SentenceTransformer models produce normalised unit vectors out of the box. OpenAI returns unnormalised — normalise before passing to semvec.

SentenceTransformers — local¶

`all-MiniLM-L6-v2` (default)¶

384 dim, 23 MB download
~14k sentences/sec on CPU, ~40k on a GPU
Trained on a diverse mix of 1B sentence pairs
Default in benchmark runners

from sentence_transformers import SentenceTransformer
import numpy as np

class STEmbedder:
    def __init__(self, name: str = "all-MiniLM-L6-v2", device: str = "cpu"):
        self._m = SentenceTransformer(name, device=device)
        self._dim = int(self._m.get_sentence_embedding_dimension() or 384)

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str):
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        vec = self._m.encode(
            text, normalize_embeddings=True,
            show_progress_bar=False, convert_to_numpy=True,
        )
        return np.asarray(vec, dtype=np.float64)

`all-mpnet-base-v2`¶

768 dim, 420 MB download
~3k sentences/sec on CPU
~2-3 pp higher retrieval accuracy on long-form benchmarks

Swap the model name in the wrapper above. Pass dimension=768 everywhere semvec takes one (SemvecConfig.dimension, etc.).

Multilingual¶

paraphrase-multilingual-MiniLM-L12-v2 (384 dim, 117 MB) covers 50+ languages. Use when your conversation histories are not English-only.

OpenAI `text-embedding-3-*`¶

Managed endpoint, no local compute. Costs per token.

import numpy as np
from openai import OpenAI

class OpenAIEmbedder:
    def __init__(self, model: str = "text-embedding-3-small"):
        self._client = OpenAI()
        self._model = model
        # 1536 for -small, 3072 for -large
        self._dim = 1536 if "small" in model else 3072

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str) -> np.ndarray:
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        response = self._client.embeddings.create(model=self._model, input=text)
        vec = np.asarray(response.data[0].embedding, dtype=np.float64)
        # Normalise — OpenAI does not.
        norm = np.linalg.norm(vec)
        return vec / norm if norm > 1e-8 else vec

Batch requests (the input parameter accepts lists) when possible to cut round-trip cost — semvec itself only needs one vector per call, but your application layer can amortise.

ONNX / quantised for production latency¶

For serverless or edge deployments, export all-MiniLM-L6-v2 to ONNX and quantise to int8:

pip install optimum onnxruntime sentence-transformers
optimum-cli export onnx \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --task feature-extraction \
    --optimize O3 \
    onnx-minilm/

Install

The Python snippet below requires the optimum runtime extras: pip install "optimum[onnxruntime]" transformers (not bundled with semvec). The pip install line in the export step above only covers the export tooling.

import numpy as np
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

class ONNXEmbedder:
    def __init__(self, path: str = "onnx-minilm"):
        self._tok = AutoTokenizer.from_pretrained(path)
        self._model = ORTModelForFeatureExtraction.from_pretrained(path)
        self._dim = 384

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str):
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        inputs = self._tok(text, return_tensors="np", truncation=True, padding=True)
        outputs = self._model(**inputs)
        # mean-pool + L2 normalise (matches SentenceTransformer default)
        vec = outputs.last_hidden_state.mean(axis=1).squeeze().astype(np.float64)
        norm = np.linalg.norm(vec)
        return vec / norm if norm > 1e-8 else vec

int8 quantisation typically cuts model size by 4× and improves p50 latency by 2-3× on CPU with < 0.5 pp accuracy loss on standard retrieval benchmarks.

Sidecar embedder daemon¶

By default each API worker loads its own model copy. With `--workers

1` that means N model copies, N GPU contexts, and N caches that never share. The sidecar daemon decouples this: one process holds the model, every API worker connects to it over UDS (or TCP) and submits encode requests. The connection is multiplexed — multiple in-flight requests share one socket without blocking each other.

Topology¶

                    ┌── worker[1]
   client ── proxy ─┼── worker[2] ── UDS ── embedder daemon ── (GPU)
                    └── worker[N]

With --embedder tcp://embedder-host:9000 the daemon can also live on a different host — useful when you want a single GPU node serving an auto-scaled API tier on CPU instances.

Mode 1 — `semvec serve` spawns the daemon¶

The simplest setup. semvec serve launches the daemon, waits for the READY handshake on an inherited fd, then starts the API workers and points them at the UDS socket.

semvec serve --workers 8 --embedder-mode sidecar

SIGTERM is forwarded cleanly: workers drain first, then the daemon completes its in-flight batch and exits.

Mode 2 — operator-managed daemon¶

Run the daemon stand-alone (systemd unit, k8s sidecar container, etc.) and point the API at it via URL or env:

python -m semvec.embedder \
    --listen unix:///run/semvec/embedder.sock \
    --model all-MiniLM-L6-v2 \
    --batch-max 32 --batch-wait-ms 5

semvec serve --workers 8 --embedder unix:///run/semvec/embedder.sock

SEMVEC_EMBEDDER_URL is read by every worker as a drop-in for --embedder, which is useful when you don't want to thread the URL through your process supervisor.

The embedder sidecar requires msgspec for its wire protocol. It is not pulled in by the [api] extra in 0.6.1, so install it explicitly when you run python -m semvec.embedder standalone:

pip install msgspec

Sidecar environment variables¶

Variable	Default	Purpose
`XDG_RUNTIME_DIR`	`/run/user/$UID` on systemd-managed Linux hosts	Socket directory for the embedder sidecar UDS when no explicit `--listen` path is set.
`HF_HOME`	`~/.cache/huggingface`	Root for the HuggingFace cache used by sentence-transformers (model weights, tokenizers). Set this on read-only / multi-tenant hosts so the cache lives on a writable volume.

Python is the default, Rust is opt-in¶

The supervisor spawns the Python daemon by default in --embedder-mode sidecar. A native Rust daemon (semvec-embedder, ONNX-backed) is picked up automatically when either of these is set:

SEMVEC_EMBEDDER_BIN=/abs/path/to/semvec-embedder — explicit override. A missing file at the path falls back to the Python daemon with a warning.
SEMVEC_USE_RUST_EMBEDDER=1 — opt-in flag. The supervisor then looks for semvec-embedder on PATH, and finally in <repo>/target/release/semvec-embedder for dev checkouts.

Aspect	Python daemon	Rust daemon (opt-in)
Ships with the wheel	✓	— (build from source)
Cold start	~5.8 s (model load + warmup)	~0.3 s
Per-process RSS	~1.5 GB	~150 MB
Per-daemon throughput on MiniLM	higher (PyTorch cuDNN)	lower (ORT FP32)
Model format	sentence-transformers	ONNX (model + tokenizer in HF cache)
Best for	shared-host production	autoscalers, edge, spot, RAM-tight VMs

The two daemons produce byte-identical output vectors (mean-pool + L2-normalise on the same MiniLM weights). Switching between them does not perturb retrieval or downstream LLM behaviour.

When opting in for the first time, populate the HF cache with the ONNX variant once via sentence-transformers:

Install

The ONNX backend ships as an extra: pip install "sentence-transformers[onnx]" (not bundled with semvec). Without the extra, backend="onnx" raises at import time.

from sentence_transformers import SentenceTransformer
SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")

The Rust daemon reads tokenizer.json and onnx/model.onnx straight from $HF_HOME/hub/... — no separate download step.

Retrieval tuning at `/v1/run`¶

The context block returned by /v1/run is the prompt-side knob most benchmarks end up sweeping. The env vars below expose every stage without touching source — set them in the worker environment, restart semvec serve, done. Defaults keep the pipeline identical to 0.5.6.

Core retrieval¶

Variable	Default	What it does
`SEMVEC_RUN_TOP_K`	`5`	How many memories surface per request. Raise it (15–25) for long-conversation recall queries; lower it for tight prompts.
`SEMVEC_MMR_FETCH_K`	`0` (off)	Fetch this many candidates and MMR-rerank down to `SEMVEC_RUN_TOP_K`. Stops the final set from being filled with near-duplicates (e.g. five mentions of the same change request). Try 50; bump to 200 when the embedding model returns weak top-K.
`SEMVEC_MMR_LAMBDA`	`0.5`	MMR relevance/diversity mix. `1.0` = pure cosine; `0.0` = pure diversity. `0.5` is a safe default.
`SEMVEC_CONTEXT_BUDGET_CHARS`	`4000`	Total characters of memory text packed into the `context` string, sum-as-you-go. Replaces the legacy per-memory 150-char cap, which mis-allocated budget: short memories wasted it, long ones got their key facts clipped.

BM25-hybrid (opt-in)¶

pip install "semvec[hybrid]" to pull in bm25s + nltk.

Variable	Default	What it does
`SEMVEC_HYBRID_BM25`	`0` (off)	Master switch. When `1`, every session maintains a per-session BM25 index alongside dense vectors and `/v1/run` fuses both candidate lists via Reciprocal Rank Fusion before the next stage.
`SEMVEC_BM25_FETCH_K`	`50`	BM25 top-K fed into the fusion. Larger pools improve recall on long sessions at small index-cost.
`SEMVEC_BM25_REBUILD_EVERY`	`64`	Ingests between snapshot rebuilds of the per-session index. Lower = fresher BM25 at higher rebuild cost.
`SEMVEC_RRF_K`	`60`	RRF smoothing constant. Standard value from the original RRF paper; rarely worth changing.
`SEMVEC_RRF_WEIGHTS`	unset (uniform)	Comma-separated per-list weights, e.g. `"1.0,0.4"` to halve the BM25 contribution. Useful when BM25 hurts single-fact precision on your domain — keep dense at `1.0` and dial BM25 down until precision recovers.

Cross-encoder rerank (opt-in)¶

Variable	Default	What it does
`SEMVEC_RERANK_MODEL`	unset (off)	HuggingFace model ID, e.g. `cross-encoder/ms-marco-MiniLM-L-6-v2`. Activating it adds a rerank stage between RRF fusion and the final top-K.
`SEMVEC_RERANK_FETCH_K`	`50`	Candidate pool fed into the cross-encoder. The reranker selects the best `SEMVEC_RUN_TOP_K` out of this pool.
`SEMVEC_RERANK_BATCH`	`64`	Cross-encoder batch size. Tune against your GPU/CPU.
`SEMVEC_RERANK_FP16`	`0`	Set `1` for FP16 inference on GPU. Roughly 1.5–2× faster, no observable quality loss.
`SEMVEC_RERANK_THREADS`	`os.cpu_count()`	Torch intra-op thread cap for CPU inference.

Best LOCOMO numbers (F1 0.495, +2.6 pp over dense-only baseline) come from the full stack at once: BM25-hybrid on, cross-encoder reranking on, MMR off, mpnet-base-v2 as the embedder. See benchmarks/running.md for the exact reproduce command and .env template.

Embedding cache (built-in)¶

semvec ships with a content-addressed LRU cache + in-flight dedup wrapper that works with any embedder (in-process, sidecar, OpenAI, ONNX). Enable it via env var — the lifespan wraps whatever embedder is currently injected:

export SEMVEC_EMBEDDER_CACHE_SIZE=10000     # entries
semvec serve --workers 8 --embedder-mode sidecar

What it does:

Cache hits skip the embedder entirely. A dict lookup + a copy replaces the GPU round-trip.
In-flight dedup. When N callers submit the same text while an encode is already in flight, they share that one Future instead of racing to populate the same cache slot.
LRU eviction. Recency-of-access bumps to the front; popitem(last=False) evicts the oldest entry on overflow.

Why this matters in practice: chat traffic has heavy repetition (system prompts, common queries, retried turns, the previous LLM response sent back in the next /v1/run). On a 100-phrase pool benchmark the cache lifted end-to-end RPS by ~2.9× without buying more GPU. Sizing: 10 000 entries at 384-d float64 is roughly 30 MB — cheap. Bump it higher for long-tail traffic.

You can also wrap manually if you're embedding the API in your own process:

snippet — `my_inner_embedder` is any object with get_dimension()/get_embedding(text); see Quickstart

from semvec.embedder.cache import CachedEmbedder

cached = CachedEmbedder(my_inner_embedder, max_size=10_000)
print(cached.stats())   # {"hits": ..., "misses": ..., "size": ..., "in_flight": ...}

The wrapper exposes the same get_dimension / submit / get_embedding / shutdown interface as the underlying embedder, so it's a drop-in replacement.

Determinism matters¶

For regression-testing or benchmark parity:

Pin the model version (all-MiniLM-L6-v2 on the same SentenceTransformer release across runs).
Use the same device (CPU is more deterministic than CUDA; CUDA introduces ~1e-6 per-embedding noise).
Disable dropout — all stock SentenceTransformer models are already in eval() mode, but custom fine-tunes may leak.
Use temperature=0.0 for any downstream LLM calls if you are comparing end-to-end accuracy, not just embeddings.

See benchmarks/parity.md for the documented drift envelope.