Wave

Experimentation with a Kubernetes-native LLM inference gateway that sits in front of vLLM and adds production features (routing, caching, tenancy, metrics).

See LLM-Inference-Gateway-TODO.md for the full roadmap.

What Wave adds on top of vLLM

OpenAI-compatible gateway: POST /v1/chat/completions with a stable API surface for clients.
Multi-tenancy hooks: tenant model allow-lists and context limits via tenant_id.
Metrics: Prometheus request QPS/latency/error counters (gateway-level).
Routing + affinity:
- Redis-backed session stickiness (conversation_id -> worker_id).
- KV-pressure-aware worker selection for new conversations.
- Eviction + reroute under KV pressure: unpin conversations from saturated workers (cold-start reroute to a healthier worker).
Priority scheduling (gateway-level): for non-streaming calls, a small request-queue that prioritizes premium over free before dispatching to the worker.
Prompt caching (conversation-scoped):
- Exact cache: normalized prompt within (conversation_id, model).
- Optional semantic cache via embeddings (if sentence-transformers is installed).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
gateway		gateway
k8s		k8s
.gitignore		.gitignore
Dockerfile.gateway		Dockerfile.gateway
Dockerfile.worker		Dockerfile.worker
LLM-Inference-Gateway-TODO.md		LLM-Inference-Gateway-TODO.md
README.md		README.md
requirements.txt		requirements.txt
wave.png		wave.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wave

What Wave adds on top of vLLM

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wave

What Wave adds on top of vLLM

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages