Skip to content

Choosing an Embedder

semvec is embedder-agnostic — anything exposing get_embedding(text) -> np.ndarray and get_dimension() -> int works. This page collects the tradeoffs for the common options so you pick the right one for your workload.

TL;DR

Profile Pick
Default, CPU-friendly, 384 dim all-MiniLM-L6-v2
Quality-first, 768 dim, 4× slower all-mpnet-base-v2
Multilingual, 384 dim paraphrase-multilingual-MiniLM-L12-v2
Managed API OpenAI text-embedding-3-small (1536 dim)
Fastest prod, quantised ONNX-export of all-MiniLM-L6-v2 at int8

All four SentenceTransformer models produce normalised unit vectors out of the box. OpenAI returns unnormalised — normalise before passing to semvec.

SentenceTransformers — local

all-MiniLM-L6-v2 (default)

  • 384 dim, 23 MB download
  • ~14k sentences/sec on CPU, ~40k on a GPU
  • Trained on a diverse mix of 1B sentence pairs
  • Matches the pss reference exactly
from sentence_transformers import SentenceTransformer
import numpy as np

class STEmbedder:
    def __init__(self, name: str = "all-MiniLM-L6-v2", device: str = "cpu"):
        self._m = SentenceTransformer(name, device=device)
        self._dim = int(self._m.get_sentence_embedding_dimension() or 384)

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str):
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        vec = self._m.encode(
            text, normalize_embeddings=True,
            show_progress_bar=False, convert_to_numpy=True,
        )
        return np.asarray(vec, dtype=np.float64)

all-mpnet-base-v2

  • 768 dim, 420 MB download
  • ~3k sentences/sec on CPU
  • ~2-3 pp higher retrieval accuracy on long-form benchmarks

Swap the model name in the wrapper above. Pass dimension=768 everywhere semvec takes one (SemvecConfig.dimension, LongMemEvalRunner(pss_config_dimension=…), …).

Multilingual

paraphrase-multilingual-MiniLM-L12-v2 (384 dim, 117 MB) covers 50+ languages. Use when your conversation histories are not English-only.

OpenAI text-embedding-3-*

Managed endpoint, no local compute. Costs per token.

import numpy as np
from openai import OpenAI

class OpenAIEmbedder:
    def __init__(self, model: str = "text-embedding-3-small"):
        self._client = OpenAI()
        self._model = model
        # 1536 for -small, 3072 for -large
        self._dim = 1536 if "small" in model else 3072

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str) -> np.ndarray:
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        response = self._client.embeddings.create(model=self._model, input=text)
        vec = np.asarray(response.data[0].embedding, dtype=np.float64)
        # Normalise — OpenAI does not.
        norm = np.linalg.norm(vec)
        return vec / norm if norm > 1e-8 else vec

Batch requests (the input parameter accepts lists) when possible to cut round-trip cost — semvec itself only needs one vector per call, but your application layer can amortise.

ONNX / quantised for production latency

For serverless or edge deployments, export all-MiniLM-L6-v2 to ONNX and quantise to int8:

pip install optimum onnxruntime sentence-transformers
optimum-cli export onnx \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --task feature-extraction \
    --optimize O3 \
    onnx-minilm/
import numpy as np
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

class ONNXEmbedder:
    def __init__(self, path: str = "onnx-minilm"):
        self._tok = AutoTokenizer.from_pretrained(path)
        self._model = ORTModelForFeatureExtraction.from_pretrained(path)
        self._dim = 384

    def get_dimension(self) -> int:
        return self._dim

    def get_embedding(self, text: str):
        if not text.strip():
            return np.zeros(self._dim, dtype=np.float64)
        inputs = self._tok(text, return_tensors="np", truncation=True, padding=True)
        outputs = self._model(**inputs)
        # mean-pool + L2 normalise (matches SentenceTransformer default)
        vec = outputs.last_hidden_state.mean(axis=1).squeeze().astype(np.float64)
        norm = np.linalg.norm(vec)
        return vec / norm if norm > 1e-8 else vec

int8 quantisation typically cuts model size by 4× and improves p50 latency by 2-3× on CPU with < 0.5 pp accuracy loss on standard retrieval benchmarks.

Embedding cache

For expensive models (ONNX / OpenAI), wrap your embedder with an in-process cache to avoid re-embedding the same text across turns:

from functools import lru_cache
import numpy as np

class CachedEmbedder:
    def __init__(self, inner, maxsize: int = 2048):
        self._inner = inner
        self.get_embedding = lru_cache(maxsize=maxsize)(self._embed)

    def _embed(self, text: str) -> np.ndarray:
        return self._inner.get_embedding(text)

    def get_dimension(self) -> int:
        return self._inner.get_dimension()

Wrapping the embedder caches representations — useful when the same text re-appears turn-to-turn.

Determinism matters

For regression-testing parity between pss and semvec:

  • Pin the model version (all-MiniLM-L6-v2 on the same SentenceTransformer release across runs).
  • Use the same device (CPU is more deterministic than CUDA; CUDA introduces ~1e-6 per-embedding noise).
  • Disable dropout — all stock SentenceTransformer models are already in eval() mode, but custom fine-tunes may leak.
  • Use temperature=0.0 for any downstream LLM calls if you are comparing end-to-end accuracy, not just embeddings.

See docs/benchmarks/parity.md for the documented drift envelope.