Skip to content

ldcx1/council-router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Council Router

Unified LLM proxy β€” aggregates multiple servers behind a single OpenAI-compatible API
Smart routing Β· Model fallbacks Β· LLM Council Β· Response storage Β· Prometheus metrics

Python 3.12+ FastAPI Tests MIT License


✨ Features

Feature Description
Multi-backend proxy Aggregate Ollama, OpenAI API, Anthropic, Gemini, or any of the 100+ providers supported via LiteLLM
Smart routing Priority-based, loaded-model preference, round-robin β€” with automatic fallback chains
Dual API OpenAI-compatible (/v1/) and Ollama-native (/api/) endpoints side by side
LLM Council Multi-model deliberation β€” multiple personas analyze your question, then a synthesizer produces one answer
Response storage Every token saved to compressed SQLite for offline analysis β€” including thinking tokens
Prometheus metrics Request rates, latency histograms, token counters, backend health, tokens/sec
Docker ready Full Compose stack with Prometheus + Grafana, pre-built dashboard included
Open WebUI compatible Drop-in replacement β€” point Open WebUI at port 11430 and go

Tip

The smart router prefers backends with the model already loaded in VRAM, so repeated queries to the same model skip cold-load latency entirely.


πŸš€ Quick Start

Prerequisites

  • Python 3.12+
  • At least one LLM backend (Ollama, OpenAI API, Anthropic, Gemini, or any LiteLLM-supported provider)

Installation

# Install
pip install -e .

# Or with dev dependencies
pip install -e ".[dev]"

Configure

# Edit config.yaml with your backends
vim config.yaml
backends:
  - name: "local-ollama"
    type: ollama
    url: "http://localhost:11434"
    priority: 1

  - name: "gpu-server"
    type: ollama
    url: "http://192.168.1.100:11434"
    priority: 2

Start the Server

# Default: http://0.0.0.0:11430
python -m council_router

Make a Request

# List all models from all backends
curl http://localhost:11430/v1/models

# Chat completion
curl -X POST http://localhost:11430/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Streaming
curl -N -X POST http://localhost:11430/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "stream": true
  }'

πŸ”„ Drop-in Replacement

OpenAI Python SDK

from openai import OpenAI

# Point at council-router instead of Ollama directly
client = OpenAI(base_url="http://localhost:11430/v1", api_key="unused")

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Ollama-Native Clients

# Works with any Ollama client β€” just change the port
curl http://localhost:11430/api/tags
curl -X POST http://localhost:11430/api/chat \
  -d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hi"}]}'

πŸ“‘ API Reference

Endpoints

Method Path Description
GET / Server info
GET /health Health check with per-backend status
GET /metrics Prometheus metrics exposition
GET /v1/models List all models across all backends
GET /v1/models/{id} Get model details + which backends serve it
POST /v1/chat/completions Chat completion (streaming + non-streaming)
POST /v1/council/chat Council deliberation β€” synthesized multi-perspective response
GET /api/tags Ollama-native model listing
POST /api/chat Ollama-native chat (NDJSON streaming)
POST /api/generate Ollama-native generate
GET /admin/backends Backend management (auth required)
GET /admin/responses Query stored responses (paginated, filterable)
GET /admin/responses/stats Storage statistics
GET /admin/responses/{id} Full response with decompressed content
GET /admin/council/templates List available council templates

Chat Completion Request

{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,            // Enable SSE streaming
  "temperature": 0.7          // Optional
}

If the requested model isn't available, the proxy checks the fallback rules and transparently substitutes an alternative. The X-Fallback-Model response header tells you which model actually served the request.


πŸ‘₯ LLM Council

The council feature enables multi-perspective deliberation for higher quality responses. Multiple personas analyze your question in parallel, then a synthesizer produces one comprehensive answer.

curl -X POST http://localhost:11430/v1/council/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Should we adopt microservices?"}]
  }'

Built-in Templates

Template Participants Rounds Description
diverse-analysis 3 1 Critical thinker + creative explorer + practical engineer
code-review 3 1 Security + performance + architecture reviewers
debate 2 2 Proponent vs. opponent, two rounds, then neutral judge

Custom templates can be added as YAML files in the council_templates/ directory.

Note

Single-model councils (all participants use the same model but different personas) trade response diversity for simplicity. For more varied insights, configure per-participant models in your template.


βš™οΈ Configuration

All settings live in config.yaml. Environment variables can be injected with ${VAR_NAME} syntax.

Setting Default Description
server.port 11430 Server port
routing.strategy smart Routing strategy: smart, priority, round-robin
routing.prefer_loaded_model true Prefer backends with model in VRAM
routing.heartbeat_interval 10 Seconds between health checks
response_storage.enabled true Save all responses to SQLite
response_storage.compress true zlib-compress stored content
response_storage.db_path ./data/responses.db Database location
metrics.enabled true Expose Prometheus metrics at /metrics
admin.auth.type bearer Admin auth: bearer or none
admin.auth.token ${ADMIN_TOKEN} Bearer token for admin endpoints

πŸ›‘οΈ Security Considerations

Authentication
  • Inference endpoints (/v1/, /api/) β€” unauthenticated, designed for trusted LAN use
  • Admin endpoints (/admin/) β€” secured with Bearer token (set via ADMIN_TOKEN env var)

If exposing to untrusted networks, add a reverse proxy with authentication in front of the inference endpoints.

Response storage

When enabled, every request and response is stored in the local SQLite database, including prompts and full response content. The data is compressed but not encrypted. Consider this when handling sensitive prompts.


🐳 Docker Compose

Full monitoring stack with one command:

docker compose up -d
Service Port Description
council-router 11430 The proxy
ollama 11434 Example Ollama backend
prometheus 9090 Metrics collection
grafana 3000 Dashboards (login: admin/admin)

A pre-built Grafana dashboard is auto-provisioned with panels for request rate, p95 latency, tokens/sec, backend status, active requests, and fallback activations.


πŸ§ͺ Development

make install-dev    # Install dev dependencies
make test           # Run all tests
make lint           # Lint
make format         # Format code
make run            # Start the server

Project Structure

council-router/
β”œβ”€β”€ council_router/
β”‚   β”œβ”€β”€ __init__.py, __main__.py       # Package + CLI entry
β”‚   β”œβ”€β”€ config.py                      # YAML config with Pydantic models
β”‚   β”œβ”€β”€ api_types.py                   # OpenAI + Ollama request/response models
β”‚   β”œβ”€β”€ auth.py                        # Bearer token admin auth
β”‚   β”œβ”€β”€ server.py                      # FastAPI app β€” all 15 endpoints
β”‚   β”œβ”€β”€ registry.py                    # Backend lifecycle + model tracking
β”‚   β”œβ”€β”€ router.py                      # Smart routing + fallback chains
β”‚   β”œβ”€β”€ backends/
β”‚   β”‚   β”œβ”€β”€ base.py                    # Abstract backend interface
β”‚   β”‚   β”œβ”€β”€ ollama.py                  # Ollama backend
β”‚   β”‚   β”œβ”€β”€ openai_compat.py           # Generic OpenAI-compatible backend
β”‚   β”‚   └── system_monitor.py          # MCP system monitor client
β”‚   β”œβ”€β”€ ollama_compat/adapter.py       # Ollama ↔ OpenAI format translation
β”‚   β”œβ”€β”€ storage/
β”‚   β”‚   β”œβ”€β”€ db.py                      # SQLite schema (WAL mode)
β”‚   β”‚   β”œβ”€β”€ writer.py                  # Async background writer
β”‚   β”‚   β”œβ”€β”€ compression.py            # zlib helpers
β”‚   β”‚   └── models.py                  # Storage Pydantic models
β”‚   β”œβ”€β”€ metrics/collector.py           # 11 Prometheus metrics
β”‚   └── council/
β”‚       β”œβ”€β”€ templates.py               # 3 built-in + YAML loader
β”‚       └── engine.py                  # Multi-round deliberation engine
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ conftest.py                    # MockBackend + shared fixtures
β”‚   β”œβ”€β”€ unit/                          # 113 unit tests
β”‚   └── component/                     # 14 component tests
β”œβ”€β”€ monitoring/                        # Prometheus + Grafana configs
β”œβ”€β”€ Dockerfile, docker-compose.yml
β”œβ”€β”€ config.yaml, pyproject.toml
└── README.md

Test suite: 127 tests β€” all passing in under 3 seconds.


πŸ“„ License

MIT

About

Smart LLM routing proxy aggregating multiple servers behind a single OpenAI-compatible API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages