Token Reduction (semvec.token_reduction)¶
Utilities for serialising state into compact LLM context and wiring real LLM endpoints behind SemvecChatProxy.
SemvecStateSerializer¶
Formats a SemvecState into a compact context string. Default budget is ~150–350 tokens; the caller controls it via SerializerConfig.max_memory_chars (per-memory cap), max_last_response_chars (last-response cap), and full_first (top-1 verbatim).
from semvec.token_reduction import SemvecStateSerializer, SerializerConfig
# Default — every memory line capped at the wheel-baked default (200 chars)
ser = SemvecStateSerializer()
context = ser.serialize(state, query_embedding=emb, last_response=prev)
# Custom — caller-controlled caps + top-1-full pattern
ser = SemvecStateSerializer(
SerializerConfig(top_k=10, max_memory_chars=500, full_first=True)
)
context = ser.serialize(state, query_embedding=emb, last_response=prev)
SerializerConfig¶
| Field | Default | Purpose |
|---|---|---|
top_k |
5 (from Rust core) |
Number of retrieved memories included. |
max_memory_chars |
200 (from Rust core) |
Per-memory truncation budget. Caller-controlled — pass any positive integer; very high values (e.g. 10_000) effectively disable truncation. |
max_last_response_chars |
500 (from Rust core) |
Truncation cap on the verbatim last_response block. Caller-controlled. |
include_phase_prompt |
True |
Prepend phase + phase-specific prompt. |
include_metrics |
True |
Append a one-line State: beta=…, fsm=… summary. |
include_last_response |
True |
Append the previous response verbatim (capped via max_last_response_chars). |
full_first |
False |
When True, the highest-ranked retrieved memory is emitted ungutted; remaining memories are still capped at max_memory_chars. Useful when the top hit is the answer and surrounding context only needs short labels. |
Top-1-full pattern (since 0.5.6):
cfg = SerializerConfig(
top_k=5,
max_memory_chars=200, # short labels for context entries 2..5
full_first=True, # entry 1 stays verbatim, no truncation
)
context = SemvecStateSerializer(cfg).serialize(state, query_text="what was the Q3 miss?")
The cap is no longer baked into the wheel — pass any positive integer. The Rust core's _internal_tuning_defaults is still used as the fallback when you leave a field at None so existing code keeps working.
serialize(state, *, query_embedding=None, query_text=None, last_response=None) -> str¶
- At least one of
query_embedding/query_textshould be provided for relevance-sorted retrieval. Without either, the serializer falls back to recency-sorted top-k. last_responseis included verbatim in the context (budget-checked).
SemvecChatProxy¶
Production-shaped chat loop: routes each turn through compressed context, stores Q&A chunks, tracks token counts.
from semvec.token_reduction import SemvecChatProxy, create_llm_client
llm = create_llm_client("openai")
proxy = SemvecChatProxy(llm_call=llm, system_prompt="You are a helpful assistant.")
result = proxy.chat("what's up with Q3?")
Constructor¶
| Parameter | Type | Default | Description |
|---|---|---|---|
llm_call |
Callable[[list[ChatMessage]], str] | None |
built-in echo mock | Your LLM callable. |
system_prompt |
str |
"You are a helpful assistant." |
Injected before every turn's context. |
pss_config |
SemvecConfig | None |
SemvecConfig() |
Internal state config. |
serializer_config |
SerializerConfig | None |
defaults | Context assembly config. |
embedding_service |
object | None |
auto SentenceTransformer | Any object with get_embedding(text) + get_dimension(). |
Missing embedding_service with no SentenceTransformer installed raises RuntimeError — see the module docstring for the full exception message and copy-paste wrapper.
When does the proxy pay for itself?¶
The compressed-context prompt the proxy builds carries a small fixed overhead — the system prompt, the phase header, the literal-cache snippet, and a couple of frame markers. On a 1-turn or 2-turn chat that overhead is larger than the raw user history it replaces, so the input-token count actually goes up.
The break-even point is around 10 turns. From there on the per-call input cost stays constant (the whole point of the engine) while a naive baseline grows linearly. By turn 48 the same conversation costs ~76 % less per call.
If your workload is dominated by sub-10-turn tasks (e.g. one-shot Q&A, sub-tasks dispatched by an orchestrator that never accumulates history), use the underlying state.update() + state.serialize_for_llm() directly and skip the proxy — you get the engine's bookkeeping without paying the prompt overhead on calls that don't need it.
chat(user_message) -> TurnResult¶
Returns a dataclass:
| Field | Type |
|---|---|
response |
str |
pss_input_tokens |
int \| None (from llm_call.last_usage, else None) |
baseline_input_tokens |
int \| None (always None in this shape) |
pss_prompt |
str |
phase |
str |
turn_number |
int |
ChatMessage¶
Dataclass with two fields: role ("system" / "user" / "assistant") and content.
LLMConfig¶
from semvec.token_reduction import LLMConfig
cfg = LLMConfig.from_env("openai")
cfg.validate()
LLMConfig.from_env(provider, prefix="")— reads[PREFIX_]PROVIDER_BASE_URL,[PREFIX_]PROVIDER_MODEL,[PREFIX_]PROVIDER_API_KEY.prefix="JUDGE"flips every variable toJUDGE_OPENAI_*with graceful fallback to the unprefixed variant.validate()raisesValueErrorfor missing required fields.
OpenAIClient / OllamaClient¶
from semvec.token_reduction import OpenAIClient, LLMConfig
client = OpenAIClient(LLMConfig(
provider="openai",
base_url="https://api.example.com/v1",
model="gpt-4",
api_key="sk-...",
temperature=0.3,
max_tokens=512,
))
text = client([ChatMessage(role="user", content="hi")])
usage = client.last_usage # {"prompt_tokens": ..., "completion_tokens": ...}
Both accept a single list[ChatMessage] and return a string. They populate last_usage from the provider's usage field when present. OpenAIClient works with the OpenAI API and any compatible endpoint such as vLLM, LiteLLM, OpenRouter.
create_llm_client(provider="openai", prefix="") -> BaseLLMClient¶
Factory that calls LLMConfig.from_env + validate() and returns the right subclass.
TokenCounter / TurnTokens / estimate_tokens¶
Utility helpers for tracking and estimating token counts per turn. estimate_tokens(text: str) uses a simple chars/4 heuristic.
get_phase_prompt(phase) -> str / PHASE_PROMPTS¶
Phase-specific instruction snippets used by the serializer when include_phase=True.
See also¶
- Quickstart — 5-minute REST + library walk-through
- Tour — token-reduction proxy — user-guide page that contextualises this API
- Architecture — abstract component model