Lightweight, Redis-backed conversation state service purpose-built for the OpenAI Responses API, vLLM, and custom inference gateways.
- The Responses API introduces
previous_response_id, yet most inference runtimes (vLLM, llama.cpp, TGI) are still stateless. - Teams repeatedly rebuild message persistence, context stitching, summarisation, and contention control.
- Automatic Prefix Caching (APC) delivers major latency gains only when prompts are byte-identical; without strict normalisation the hit rate collapses.
ConvoStore is a thin, standalone service that absorbs those concerns so any gateway can expose stateful conversations without binding logic to the inference layer.
-
Session Management
- Two-level addressing (
conversation_id+response_id). - Fast lookup by
previous_response_idwith idempotent writes. - Atomic append operations to avoid race conditions.
- Two-level addressing (
-
Prompt Normalisation
- Frozen system prompts and few-shot exemplars.
- Consistent delimiters, line endings, and tokenizer settings.
- Returns a
prefix_fingerprintso you can track APC hit rates over time.
-
History Trimming & Summaries
- Preserve the latest K turns verbatim.
- Collapse older context into templated summaries to stay within token budgets.
- Configurable maximum token ceilings per session.
-
High-Performance Implementation
- Go HTTP service with low-latency concurrency primitives.
- Redis backing with Lua-driven append + compare-and-set to ensure atomicity.
- Sub-millisecond hot path for resolve operations when served from cache.
-
Observability Tooling
- Metrics:
resolve_latency_ms,append_conflicts,session_size_bytes,estimated_input_tokens,apc_fingerprint_changes. - Ready for Prometheus / Grafana dashboards.
- Structured logs and tracing hooks for incident triage.
- Metrics:
[Client / Gateway / LiteLLM]
|
(HTTP/SSE)
v
┌───────────────────────────────┐
│ ConvoStore Service │
│ - /resolve (previous_id→messages)
│ - /append (append history, idempotent)
│ - /store_response (persist model output)
│ - /trim (summaries / budgeting)
│ - Metrics / Auth / Quota
└──────────────┬────────────────┘
│
v
┌─────────────┐
│ Redis │
│ (Cluster) │
└─────────────┘
- Resolve P50: 2–5 ms / P95 < 10 ms.
- Append atomic write: < 3 ms.
- APC hit rate: > 80 % with stable templates.
- Scale: millions of sessions with Redis Cluster sharding.
# Run the service
docker run -d \
-e REDIS_URL=redis://host:6379 \
-p 8080:8080 \
convostore:latest
# Append a message
curl -X POST http://localhost:8080/v1/sessions/conv123/append \
-H "Content-Type: application/json" \
-d '{"role":"user","content":"Hello"}'
# Resolve context from a previous response id
curl "http://localhost:8080/v1/sessions/resolve?previous_response_id=rsp_abc123"- Multi-model context support (system prompts, adapters).
- Built-in summariser with pluggable models.
- KServe / LiteLLM integration kits.
- Helm chart and Kubernetes operator for production rollouts.
- Gateway authors that need pluggable session state (LiteLLM, KServe, bespoke proxies).
- Inference platform teams chasing APC gains, lower first-token latency, and predictable quotas.
- Applied ML platforms wanting observability around prompt reuse and session growth.
MIT