Unified LLM proxy β aggregates multiple servers behind a single OpenAI-compatible API
Smart routing Β· Model fallbacks Β· LLM Council Β· Response storage Β· Prometheus metrics
| Feature | Description |
|---|---|
| Multi-backend proxy | Aggregate Ollama, OpenAI API, Anthropic, Gemini, or any of the 100+ providers supported via LiteLLM |
| Smart routing | Priority-based, loaded-model preference, round-robin β with automatic fallback chains |
| Dual API | OpenAI-compatible (/v1/) and Ollama-native (/api/) endpoints side by side |
| LLM Council | Multi-model deliberation β multiple personas analyze your question, then a synthesizer produces one answer |
| Response storage | Every token saved to compressed SQLite for offline analysis β including thinking tokens |
| Prometheus metrics | Request rates, latency histograms, token counters, backend health, tokens/sec |
| Docker ready | Full Compose stack with Prometheus + Grafana, pre-built dashboard included |
| Open WebUI compatible | Drop-in replacement β point Open WebUI at port 11430 and go |
Tip
The smart router prefers backends with the model already loaded in VRAM, so repeated queries to the same model skip cold-load latency entirely.
- Python 3.12+
- At least one LLM backend (Ollama, OpenAI API, Anthropic, Gemini, or any LiteLLM-supported provider)
# Install
pip install -e .
# Or with dev dependencies
pip install -e ".[dev]"# Edit config.yaml with your backends
vim config.yamlbackends:
- name: "local-ollama"
type: ollama
url: "http://localhost:11434"
priority: 1
- name: "gpu-server"
type: ollama
url: "http://192.168.1.100:11434"
priority: 2# Default: http://0.0.0.0:11430
python -m council_router# List all models from all backends
curl http://localhost:11430/v1/models
# Chat completion
curl -X POST http://localhost:11430/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Streaming
curl -N -X POST http://localhost:11430/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Write a poem"}],
"stream": true
}'from openai import OpenAI
# Point at council-router instead of Ollama directly
client = OpenAI(base_url="http://localhost:11430/v1", api_key="unused")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)# Works with any Ollama client β just change the port
curl http://localhost:11430/api/tags
curl -X POST http://localhost:11430/api/chat \
-d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hi"}]}'| Method | Path | Description |
|---|---|---|
GET |
/ |
Server info |
GET |
/health |
Health check with per-backend status |
GET |
/metrics |
Prometheus metrics exposition |
GET |
/v1/models |
List all models across all backends |
GET |
/v1/models/{id} |
Get model details + which backends serve it |
POST |
/v1/chat/completions |
Chat completion (streaming + non-streaming) |
POST |
/v1/council/chat |
Council deliberation β synthesized multi-perspective response |
GET |
/api/tags |
Ollama-native model listing |
POST |
/api/chat |
Ollama-native chat (NDJSON streaming) |
POST |
/api/generate |
Ollama-native generate |
GET |
/admin/backends |
Backend management (auth required) |
GET |
/admin/responses |
Query stored responses (paginated, filterable) |
GET |
/admin/responses/stats |
Storage statistics |
GET |
/admin/responses/{id} |
Full response with decompressed content |
GET |
/admin/council/templates |
List available council templates |
If the requested model isn't available, the proxy checks the fallback rules and transparently substitutes an alternative. The
X-Fallback-Modelresponse header tells you which model actually served the request.
The council feature enables multi-perspective deliberation for higher quality responses. Multiple personas analyze your question in parallel, then a synthesizer produces one comprehensive answer.
curl -X POST http://localhost:11430/v1/council/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Should we adopt microservices?"}]
}'| Template | Participants | Rounds | Description |
|---|---|---|---|
diverse-analysis |
3 | 1 | Critical thinker + creative explorer + practical engineer |
code-review |
3 | 1 | Security + performance + architecture reviewers |
debate |
2 | 2 | Proponent vs. opponent, two rounds, then neutral judge |
Custom templates can be added as YAML files in the council_templates/ directory.
Note
Single-model councils (all participants use the same model but different personas) trade response diversity for simplicity. For more varied insights, configure per-participant models in your template.
All settings live in config.yaml. Environment variables can be injected with ${VAR_NAME} syntax.
| Setting | Default | Description |
|---|---|---|
server.port |
11430 |
Server port |
routing.strategy |
smart |
Routing strategy: smart, priority, round-robin |
routing.prefer_loaded_model |
true |
Prefer backends with model in VRAM |
routing.heartbeat_interval |
10 |
Seconds between health checks |
response_storage.enabled |
true |
Save all responses to SQLite |
response_storage.compress |
true |
zlib-compress stored content |
response_storage.db_path |
./data/responses.db |
Database location |
metrics.enabled |
true |
Expose Prometheus metrics at /metrics |
admin.auth.type |
bearer |
Admin auth: bearer or none |
admin.auth.token |
${ADMIN_TOKEN} |
Bearer token for admin endpoints |
Authentication
- Inference endpoints (
/v1/,/api/) β unauthenticated, designed for trusted LAN use - Admin endpoints (
/admin/) β secured with Bearer token (set viaADMIN_TOKENenv var)
If exposing to untrusted networks, add a reverse proxy with authentication in front of the inference endpoints.
Response storage
When enabled, every request and response is stored in the local SQLite database, including prompts and full response content. The data is compressed but not encrypted. Consider this when handling sensitive prompts.
Full monitoring stack with one command:
docker compose up -d| Service | Port | Description |
|---|---|---|
| council-router | 11430 |
The proxy |
| ollama | 11434 |
Example Ollama backend |
| prometheus | 9090 |
Metrics collection |
| grafana | 3000 |
Dashboards (login: admin/admin) |
A pre-built Grafana dashboard is auto-provisioned with panels for request rate, p95 latency, tokens/sec, backend status, active requests, and fallback activations.
make install-dev # Install dev dependencies
make test # Run all tests
make lint # Lint
make format # Format code
make run # Start the servercouncil-router/
βββ council_router/
β βββ __init__.py, __main__.py # Package + CLI entry
β βββ config.py # YAML config with Pydantic models
β βββ api_types.py # OpenAI + Ollama request/response models
β βββ auth.py # Bearer token admin auth
β βββ server.py # FastAPI app β all 15 endpoints
β βββ registry.py # Backend lifecycle + model tracking
β βββ router.py # Smart routing + fallback chains
β βββ backends/
β β βββ base.py # Abstract backend interface
β β βββ ollama.py # Ollama backend
β β βββ openai_compat.py # Generic OpenAI-compatible backend
β β βββ system_monitor.py # MCP system monitor client
β βββ ollama_compat/adapter.py # Ollama β OpenAI format translation
β βββ storage/
β β βββ db.py # SQLite schema (WAL mode)
β β βββ writer.py # Async background writer
β β βββ compression.py # zlib helpers
β β βββ models.py # Storage Pydantic models
β βββ metrics/collector.py # 11 Prometheus metrics
β βββ council/
β βββ templates.py # 3 built-in + YAML loader
β βββ engine.py # Multi-round deliberation engine
βββ tests/
β βββ conftest.py # MockBackend + shared fixtures
β βββ unit/ # 113 unit tests
β βββ component/ # 14 component tests
βββ monitoring/ # Prometheus + Grafana configs
βββ Dockerfile, docker-compose.yml
βββ config.yaml, pyproject.toml
βββ README.md
Test suite: 127 tests β all passing in under 3 seconds.
{ "model": "llama3.1:8b", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], "stream": false, // Enable SSE streaming "temperature": 0.7 // Optional }