Token Reduction (`semvec.token_reduction`)¶

Utilities for serialising state into compact LLM context and wiring real LLM endpoints behind SemvecChatProxy.

`SemvecStateSerializer`¶

Formats a SemvecState into a compact context string. Default budget is ~150–350 tokens; the caller controls it via SerializerConfig.max_memory_chars (per-memory cap), max_last_response_chars (last-response cap), and full_first (top-1 verbatim).

snippet — assumes `state` is a populated SemvecState, `emb` is a query embedding, `prev` is the previous response

from semvec.token_reduction import SemvecStateSerializer, SerializerConfig

# Default — every memory line capped at the wheel-baked default (200 chars)
ser = SemvecStateSerializer()
context = ser.serialize(state, query_embedding=emb, last_response=prev)

# Custom — caller-controlled caps + top-1-full pattern
ser = SemvecStateSerializer(
    SerializerConfig(top_k=10, max_memory_chars=500, full_first=True)
)
context = ser.serialize(state, query_embedding=emb, last_response=prev)

`SerializerConfig`¶

Field	Default	Purpose
`top_k`	`5` (from Rust core)	Number of retrieved memories included.
`max_memory_chars`	`200` (from Rust core)	Per-memory truncation budget. Caller-controlled — pass any positive integer; very high values (e.g. `10_000`) effectively disable truncation.
`max_last_response_chars`	`500` (from Rust core)	Truncation cap on the verbatim `last_response` block. Caller-controlled.
`include_phase_prompt`	`True`	Prepend phase + phase-specific prompt.
`include_metrics`	`True`	Append a one-line `State: beta=…, fsm=…` summary.
`include_last_response`	`True`	Append the previous response verbatim (capped via `max_last_response_chars`).
`full_first`	`False`	When `True`, the highest-ranked retrieved memory is emitted ungutted; remaining memories are still capped at `max_memory_chars`. Useful when the top hit is the answer and surrounding context only needs short labels.

Top-1-full pattern (since 0.5.6):

cfg = SerializerConfig(
    top_k=5,
    max_memory_chars=200,   # short labels for context entries 2..5
    full_first=True,        # entry 1 stays verbatim, no truncation
)
context = SemvecStateSerializer(cfg).serialize(state, query_text="what was the Q3 miss?")

The cap is no longer baked into the wheel — pass any positive integer. The Rust core's _internal_tuning_defaults is still used as the fallback when you leave a field at None so existing code keeps working.

`serialize(state, *, query_embedding=None, query_text=None, last_response=None) -> str`¶

At least one of query_embedding / query_text should be provided for relevance-sorted retrieval. Without either, the serializer falls back to recency-sorted top-k.
last_response is included verbatim in the context (budget-checked).

`SemvecChatProxy`¶

Production-shaped chat loop: routes each turn through compressed context, stores Q&A chunks, tracks token counts.

from semvec.token_reduction import SemvecChatProxy, create_llm_client

llm = create_llm_client("openai")
proxy = SemvecChatProxy(llm_call=llm, system_prompt="You are a helpful assistant.")
result = proxy.chat("what's up with Q3?")

Constructor¶

Parameter	Type	Default	Description
`llm_call`	`Callable[[list[ChatMessage]], str]` \| `None`	built-in echo mock	Your LLM callable.
`system_prompt`	`str`	`"You are a helpful assistant."`	Injected before every turn's context.
`pss_config`	`SemvecConfig` \| `None`	`SemvecConfig()`	Internal state config.
`serializer_config`	`SerializerConfig` \| `None`	defaults	Context assembly config.
`embedding_service`	object \| `None`	auto SentenceTransformer	Any object with `get_embedding(text)` + `get_dimension()`.

Missing embedding_service with no SentenceTransformer installed raises RuntimeError — see the module docstring for the full exception message and copy-paste wrapper.

When does the proxy pay for itself?¶

The compressed-context prompt the proxy builds carries a small fixed overhead — the system prompt, the phase header, the literal-cache snippet, and a couple of frame markers. On a 1-turn or 2-turn chat that overhead is larger than the raw user history it replaces, so the input-token count actually goes up.

The break-even point is around 10 turns. From there on the per-call input cost stays constant (the whole point of the engine) while a naive baseline grows linearly. By turn 48 the same conversation costs ~76 % less per call.

If your workload is dominated by sub-10-turn tasks (e.g. one-shot Q&A, sub-tasks dispatched by an orchestrator that never accumulates history), use the underlying state.update() + state.serialize_for_llm() directly and skip the proxy — you get the engine's bookkeeping without paying the prompt overhead on calls that don't need it.

`chat(user_message) -> TurnResult`¶

Returns a dataclass:

Field	Type
`response`	`str`
`pss_input_tokens`	`int \\| None` (from `llm_call.last_usage`, else `None`)
`baseline_input_tokens`	`int \\| None` (always `None` in this shape)
`pss_prompt`	`str`
`phase`	`str`
`turn_number`	`int`

`ChatMessage`¶

ChatMessage(role="user", content="hello")

Dataclass with two fields: role ("system" / "user" / "assistant") and content.

`LLMConfig`¶

snippet — requires OPENAI_BASE_URL / OPENAI_MODEL / OPENAI_API_KEY in the environment

from semvec.token_reduction import LLMConfig

cfg = LLMConfig.from_env("openai")
cfg.validate()

LLMConfig.from_env(provider, prefix="") — reads [PREFIX_]PROVIDER_BASE_URL, [PREFIX_]PROVIDER_MODEL, [PREFIX_]PROVIDER_API_KEY.
prefix="JUDGE" flips every variable to JUDGE_OPENAI_* with graceful fallback to the unprefixed variant.
validate() raises ValueError for missing required fields.

`OpenAIClient` / `OllamaClient`¶

from semvec.token_reduction import OpenAIClient, LLMConfig

client = OpenAIClient(LLMConfig(
    provider="openai",
    base_url="https://api.example.com/v1",
    model="gpt-4",
    api_key="sk-...",
    temperature=0.3,
    max_tokens=512,
))
text = client([ChatMessage(role="user", content="hi")])
usage = client.last_usage  # {"prompt_tokens": ..., "completion_tokens": ...}

Both accept a single list[ChatMessage] and return a string. They populate last_usage from the provider's usage field when present. OpenAIClient works with the OpenAI API and any compatible endpoint such as vLLM, LiteLLM, OpenRouter.

`create_llm_client(provider="openai", prefix="") -> BaseLLMClient`¶

Factory that calls LLMConfig.from_env + validate() and returns the right subclass.

`TokenCounter` / `TurnTokens` / `estimate_tokens`¶

Utility helpers for tracking and estimating token counts per turn. estimate_tokens(text: str) uses a simple chars/4 heuristic.

`get_phase_prompt(phase) -> str` / `PHASE_PROMPTS`¶

Phase-specific instruction snippets used by the serializer when include_phase=True.

Token Reduction (semvec.token_reduction)¶

SemvecStateSerializer¶

SerializerConfig¶

serialize(state, *, query_embedding=None, query_text=None, last_response=None) -> str¶

SemvecChatProxy¶

Constructor¶

When does the proxy pay for itself?¶

chat(user_message) -> TurnResult¶

ChatMessage¶

LLMConfig¶

OpenAIClient / OllamaClient¶

create_llm_client(provider="openai", prefix="") -> BaseLLMClient¶

TokenCounter / TurnTokens / estimate_tokens¶

get_phase_prompt(phase) -> str / PHASE_PROMPTS¶

See also¶