Choosing an Embedder¶
semvec is embedder-agnostic — anything exposing get_embedding(text) -> np.ndarray and get_dimension() -> int works. This page collects the tradeoffs for the common options so you pick the right one for your workload.
TL;DR¶
| Profile | Pick |
|---|---|
| Default, CPU-friendly, 384 dim | all-MiniLM-L6-v2 |
| Quality-first, 768 dim, 4× slower | all-mpnet-base-v2 |
| Multilingual, 384 dim | paraphrase-multilingual-MiniLM-L12-v2 |
| Managed API | OpenAI text-embedding-3-small (1536 dim) |
| Fastest prod, quantised | ONNX-export of all-MiniLM-L6-v2 at int8 |
All four SentenceTransformer models produce normalised unit vectors out of the box. OpenAI returns unnormalised — normalise before passing to semvec.
SentenceTransformers — local¶
all-MiniLM-L6-v2 (default)¶
- 384 dim, 23 MB download
- ~14k sentences/sec on CPU, ~40k on a GPU
- Trained on a diverse mix of 1B sentence pairs
- Matches the pss reference exactly
from sentence_transformers import SentenceTransformer
import numpy as np
class STEmbedder:
def __init__(self, name: str = "all-MiniLM-L6-v2", device: str = "cpu"):
self._m = SentenceTransformer(name, device=device)
self._dim = int(self._m.get_sentence_embedding_dimension() or 384)
def get_dimension(self) -> int:
return self._dim
def get_embedding(self, text: str):
if not text.strip():
return np.zeros(self._dim, dtype=np.float64)
vec = self._m.encode(
text, normalize_embeddings=True,
show_progress_bar=False, convert_to_numpy=True,
)
return np.asarray(vec, dtype=np.float64)
all-mpnet-base-v2¶
- 768 dim, 420 MB download
- ~3k sentences/sec on CPU
- ~2-3 pp higher retrieval accuracy on long-form benchmarks
Swap the model name in the wrapper above. Pass dimension=768 everywhere semvec takes one (SemvecConfig.dimension, LongMemEvalRunner(pss_config_dimension=…), …).
Multilingual¶
paraphrase-multilingual-MiniLM-L12-v2 (384 dim, 117 MB) covers 50+ languages. Use when your conversation histories are not English-only.
OpenAI text-embedding-3-*¶
Managed endpoint, no local compute. Costs per token.
import numpy as np
from openai import OpenAI
class OpenAIEmbedder:
def __init__(self, model: str = "text-embedding-3-small"):
self._client = OpenAI()
self._model = model
# 1536 for -small, 3072 for -large
self._dim = 1536 if "small" in model else 3072
def get_dimension(self) -> int:
return self._dim
def get_embedding(self, text: str) -> np.ndarray:
if not text.strip():
return np.zeros(self._dim, dtype=np.float64)
response = self._client.embeddings.create(model=self._model, input=text)
vec = np.asarray(response.data[0].embedding, dtype=np.float64)
# Normalise — OpenAI does not.
norm = np.linalg.norm(vec)
return vec / norm if norm > 1e-8 else vec
Batch requests (the input parameter accepts lists) when possible to cut round-trip cost — semvec itself only needs one vector per call, but your application layer can amortise.
ONNX / quantised for production latency¶
For serverless or edge deployments, export all-MiniLM-L6-v2 to ONNX and quantise to int8:
pip install optimum onnxruntime sentence-transformers
optimum-cli export onnx \
--model sentence-transformers/all-MiniLM-L6-v2 \
--task feature-extraction \
--optimize O3 \
onnx-minilm/
import numpy as np
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
class ONNXEmbedder:
def __init__(self, path: str = "onnx-minilm"):
self._tok = AutoTokenizer.from_pretrained(path)
self._model = ORTModelForFeatureExtraction.from_pretrained(path)
self._dim = 384
def get_dimension(self) -> int:
return self._dim
def get_embedding(self, text: str):
if not text.strip():
return np.zeros(self._dim, dtype=np.float64)
inputs = self._tok(text, return_tensors="np", truncation=True, padding=True)
outputs = self._model(**inputs)
# mean-pool + L2 normalise (matches SentenceTransformer default)
vec = outputs.last_hidden_state.mean(axis=1).squeeze().astype(np.float64)
norm = np.linalg.norm(vec)
return vec / norm if norm > 1e-8 else vec
int8 quantisation typically cuts model size by 4× and improves p50 latency by 2-3× on CPU with < 0.5 pp accuracy loss on standard retrieval benchmarks.
Embedding cache¶
For expensive models (ONNX / OpenAI), wrap your embedder with an in-process cache to avoid re-embedding the same text across turns:
from functools import lru_cache
import numpy as np
class CachedEmbedder:
def __init__(self, inner, maxsize: int = 2048):
self._inner = inner
self.get_embedding = lru_cache(maxsize=maxsize)(self._embed)
def _embed(self, text: str) -> np.ndarray:
return self._inner.get_embedding(text)
def get_dimension(self) -> int:
return self._inner.get_dimension()
Wrapping the embedder caches representations — useful when the same text re-appears turn-to-turn.
Determinism matters¶
For regression-testing parity between pss and semvec:
- Pin the model version (
all-MiniLM-L6-v2on the same SentenceTransformer release across runs). - Use the same
device(CPU is more deterministic than CUDA; CUDA introduces ~1e-6 per-embedding noise). - Disable dropout — all stock SentenceTransformer models are already in
eval()mode, but custom fine-tunes may leak. - Use
temperature=0.0for any downstream LLM calls if you are comparing end-to-end accuracy, not just embeddings.
See docs/benchmarks/parity.md for the documented drift envelope.