Skip to content

samzong/convostore

Repository files navigation

English | 中文

ConvoStore

Lightweight, Redis-backed conversation state service purpose-built for the OpenAI Responses API, vLLM, and custom inference gateways.


Why ConvoStore

  • The Responses API introduces previous_response_id, yet most inference runtimes (vLLM, llama.cpp, TGI) are still stateless.
  • Teams repeatedly rebuild message persistence, context stitching, summarisation, and contention control.
  • Automatic Prefix Caching (APC) delivers major latency gains only when prompts are byte-identical; without strict normalisation the hit rate collapses.

ConvoStore is a thin, standalone service that absorbs those concerns so any gateway can expose stateful conversations without binding logic to the inference layer.


What ConvoStore Delivers

  1. Session Management

    • Two-level addressing (conversation_id + response_id).
    • Fast lookup by previous_response_id with idempotent writes.
    • Atomic append operations to avoid race conditions.
  2. Prompt Normalisation

    • Frozen system prompts and few-shot exemplars.
    • Consistent delimiters, line endings, and tokenizer settings.
    • Returns a prefix_fingerprint so you can track APC hit rates over time.
  3. History Trimming & Summaries

    • Preserve the latest K turns verbatim.
    • Collapse older context into templated summaries to stay within token budgets.
    • Configurable maximum token ceilings per session.
  4. High-Performance Implementation

    • Go HTTP service with low-latency concurrency primitives.
    • Redis backing with Lua-driven append + compare-and-set to ensure atomicity.
    • Sub-millisecond hot path for resolve operations when served from cache.
  5. Observability Tooling

    • Metrics: resolve_latency_ms, append_conflicts, session_size_bytes, estimated_input_tokens, apc_fingerprint_changes.
    • Ready for Prometheus / Grafana dashboards.
    • Structured logs and tracing hooks for incident triage.

Architecture

[Client / Gateway / LiteLLM]
        |
   (HTTP/SSE)
        v
┌───────────────────────────────┐
│         ConvoStore Service    │
│  - /resolve (previous_id→messages)
│  - /append  (append history, idempotent)
│  - /store_response (persist model output)
│  - /trim (summaries / budgeting)
│  - Metrics / Auth / Quota
└──────────────┬────────────────┘
               │
               v
        ┌─────────────┐
        │   Redis     │
        │  (Cluster)  │
        └─────────────┘

Performance Targets

  • Resolve P50: 2–5 ms / P95 < 10 ms.
  • Append atomic write: < 3 ms.
  • APC hit rate: > 80 % with stable templates.
  • Scale: millions of sessions with Redis Cluster sharding.

Quick Start

# Run the service
docker run -d \
  -e REDIS_URL=redis://host:6379 \
  -p 8080:8080 \
  convostore:latest

# Append a message
curl -X POST http://localhost:8080/v1/sessions/conv123/append \
  -H "Content-Type: application/json" \
  -d '{"role":"user","content":"Hello"}'

# Resolve context from a previous response id
curl "http://localhost:8080/v1/sessions/resolve?previous_response_id=rsp_abc123"

Roadmap

  • Multi-model context support (system prompts, adapters).
  • Built-in summariser with pluggable models.
  • KServe / LiteLLM integration kits.
  • Helm chart and Kubernetes operator for production rollouts.

Who Benefits

  • Gateway authors that need pluggable session state (LiteLLM, KServe, bespoke proxies).
  • Inference platform teams chasing APC gains, lower first-token latency, and predictable quotas.
  • Applied ML platforms wanting observability around prompt reuse and session growth.

License

MIT

About

Lightweight Redis-backed conversation state service that normalizes prompts, trims history, and exposes APC-friendly context for OpenAI Responses API, vLLM, and custom inference gateways.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors