Council Router

Unified LLM proxy — aggregates multiple servers behind a single OpenAI-compatible API
Smart routing · Model fallbacks · LLM Council · Response storage · Prometheus metrics

✨ Features

Feature	Description
Multi-backend proxy	Aggregate Ollama, OpenAI API, Anthropic, Gemini, or any of the 100+ providers supported via LiteLLM
Smart routing	Priority-based, loaded-model preference, round-robin — with automatic fallback chains
Dual API	OpenAI-compatible (`/v1/`) and Ollama-native (`/api/`) endpoints side by side
LLM Council	Multi-model deliberation — multiple personas analyze your question, then a synthesizer produces one answer
Response storage	Every token saved to compressed SQLite for offline analysis — including thinking tokens
Prometheus metrics	Request rates, latency histograms, token counters, backend health, tokens/sec
Docker ready	Full Compose stack with Prometheus + Grafana, pre-built dashboard included
Open WebUI compatible	Drop-in replacement — point Open WebUI at port `11430` and go

Tip

The smart router prefers backends with the model already loaded in VRAM, so repeated queries to the same model skip cold-load latency entirely.

🚀 Quick Start

Prerequisites

Python 3.12+
At least one LLM backend (Ollama, OpenAI API, Anthropic, Gemini, or any LiteLLM-supported provider)

Installation

# Install
pip install -e .

# Or with dev dependencies
pip install -e ".[dev]"

Configure

# Edit config.yaml with your backends
vim config.yaml

backends:
  - name: "local-ollama"
    type: ollama
    url: "http://localhost:11434"
    priority: 1

  - name: "gpu-server"
    type: ollama
    url: "http://192.168.1.100:11434"
    priority: 2

Start the Server

# Default: http://0.0.0.0:11430
python -m council_router

Make a Request

# List all models from all backends
curl http://localhost:11430/v1/models

# Chat completion
curl -X POST http://localhost:11430/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Streaming
curl -N -X POST http://localhost:11430/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Write a poem"}],
    "stream": true
  }'

🔄 Drop-in Replacement

OpenAI Python SDK

from openai import OpenAI

# Point at council-router instead of Ollama directly
client = OpenAI(base_url="http://localhost:11430/v1", api_key="unused")

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Ollama-Native Clients

# Works with any Ollama client — just change the port
curl http://localhost:11430/api/tags
curl -X POST http://localhost:11430/api/chat \
  -d '{"model": "llama3.1:8b", "messages": [{"role": "user", "content": "Hi"}]}'

📡 API Reference

Endpoints

Method	Path	Description
`GET`	`/`	Server info
`GET`	`/health`	Health check with per-backend status
`GET`	`/metrics`	Prometheus metrics exposition
`GET`	`/v1/models`	List all models across all backends
`GET`	`/v1/models/{id}`	Get model details + which backends serve it
`POST`	`/v1/chat/completions`	Chat completion (streaming + non-streaming)
`POST`	`/v1/council/chat`	Council deliberation — synthesized multi-perspective response
`GET`	`/api/tags`	Ollama-native model listing
`POST`	`/api/chat`	Ollama-native chat (NDJSON streaming)
`POST`	`/api/generate`	Ollama-native generate
`GET`	`/admin/backends`	Backend management (auth required)
`GET`	`/admin/responses`	Query stored responses (paginated, filterable)
`GET`	`/admin/responses/stats`	Storage statistics
`GET`	`/admin/responses/{id}`	Full response with decompressed content
`GET`	`/admin/council/templates`	List available council templates

Chat Completion Request

{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,            // Enable SSE streaming
  "temperature": 0.7          // Optional
}

If the requested model isn't available, the proxy checks the fallback rules and transparently substitutes an alternative. The X-Fallback-Model response header tells you which model actually served the request.

👥 LLM Council

The council feature enables multi-perspective deliberation for higher quality responses. Multiple personas analyze your question in parallel, then a synthesizer produces one comprehensive answer.

curl -X POST http://localhost:11430/v1/council/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Should we adopt microservices?"}]
  }'

Built-in Templates

Template	Participants	Rounds	Description
`diverse-analysis`	3	1	Critical thinker + creative explorer + practical engineer
`code-review`	3	1	Security + performance + architecture reviewers
`debate`	2	2	Proponent vs. opponent, two rounds, then neutral judge

Custom templates can be added as YAML files in the council_templates/ directory.

Note

Single-model councils (all participants use the same model but different personas) trade response diversity for simplicity. For more varied insights, configure per-participant models in your template.

⚙️ Configuration

All settings live in config.yaml. Environment variables can be injected with ${VAR_NAME} syntax.

Setting	Default	Description
`server.port`	`11430`	Server port
`routing.strategy`	`smart`	Routing strategy: `smart`, `priority`, `round-robin`
`routing.prefer_loaded_model`	`true`	Prefer backends with model in VRAM
`routing.heartbeat_interval`	`10`	Seconds between health checks
`response_storage.enabled`	`true`	Save all responses to SQLite
`response_storage.compress`	`true`	zlib-compress stored content
`response_storage.db_path`	`./data/responses.db`	Database location
`metrics.enabled`	`true`	Expose Prometheus metrics at `/metrics`
`admin.auth.type`	`bearer`	Admin auth: `bearer` or `none`
`admin.auth.token`	`${ADMIN_TOKEN}`	Bearer token for admin endpoints

🛡️ Security Considerations

Authentication

Inference endpoints (/v1/, /api/) — unauthenticated, designed for trusted LAN use
Admin endpoints (/admin/) — secured with Bearer token (set via ADMIN_TOKEN env var)

If exposing to untrusted networks, add a reverse proxy with authentication in front of the inference endpoints.

Response storage

When enabled, every request and response is stored in the local SQLite database, including prompts and full response content. The data is compressed but not encrypted. Consider this when handling sensitive prompts.

🐳 Docker Compose

Full monitoring stack with one command:

docker compose up -d

Service	Port	Description
council-router	`11430`	The proxy
ollama	`11434`	Example Ollama backend
prometheus	`9090`	Metrics collection
grafana	`3000`	Dashboards (login: `admin`/`admin`)

A pre-built Grafana dashboard is auto-provisioned with panels for request rate, p95 latency, tokens/sec, backend status, active requests, and fallback activations.

🧪 Development

make install-dev    # Install dev dependencies
make test           # Run all tests
make lint           # Lint
make format         # Format code
make run            # Start the server

Project Structure

council-router/
├── council_router/
│   ├── __init__.py, __main__.py       # Package + CLI entry
│   ├── config.py                      # YAML config with Pydantic models
│   ├── api_types.py                   # OpenAI + Ollama request/response models
│   ├── auth.py                        # Bearer token admin auth
│   ├── server.py                      # FastAPI app — all 15 endpoints
│   ├── registry.py                    # Backend lifecycle + model tracking
│   ├── router.py                      # Smart routing + fallback chains
│   ├── backends/
│   │   ├── base.py                    # Abstract backend interface
│   │   ├── ollama.py                  # Ollama backend
│   │   ├── openai_compat.py           # Generic OpenAI-compatible backend
│   │   └── system_monitor.py          # MCP system monitor client
│   ├── ollama_compat/adapter.py       # Ollama ↔ OpenAI format translation
│   ├── storage/
│   │   ├── db.py                      # SQLite schema (WAL mode)
│   │   ├── writer.py                  # Async background writer
│   │   ├── compression.py            # zlib helpers
│   │   └── models.py                  # Storage Pydantic models
│   ├── metrics/collector.py           # 11 Prometheus metrics
│   └── council/
│       ├── templates.py               # 3 built-in + YAML loader
│       └── engine.py                  # Multi-round deliberation engine
├── tests/
│   ├── conftest.py                    # MockBackend + shared fixtures
│   ├── unit/                          # 113 unit tests
│   └── component/                     # 14 component tests
├── monitoring/                        # Prometheus + Grafana configs
├── Dockerfile, docker-compose.yml
├── config.yaml, pyproject.toml
└── README.md

Test suite: 127 tests — all passing in under 3 seconds.

📄 License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Council Router

✨ Features

🚀 Quick Start

Prerequisites

Installation

Configure

Start the Server

Make a Request

🔄 Drop-in Replacement

OpenAI Python SDK

Ollama-Native Clients

📡 API Reference

Endpoints

Chat Completion Request

👥 LLM Council

Built-in Templates

⚙️ Configuration

🛡️ Security Considerations

🐳 Docker Compose

🧪 Development

Project Structure

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
council_router		council_router
monitoring		monitoring
tests		tests
.gitignore		.gitignore
CODE_STANDARDS.md		CODE_STANDARDS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
plan.md		plan.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Council Router

✨ Features

🚀 Quick Start

Prerequisites

Installation

Configure

Start the Server

Make a Request

🔄 Drop-in Replacement

OpenAI Python SDK

Ollama-Native Clients

📡 API Reference

Endpoints

Chat Completion Request

👥 LLM Council

Built-in Templates

⚙️ Configuration

🛡️ Security Considerations

🐳 Docker Compose

🧪 Development

Project Structure

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages