Skip to content

kianwoon/modelweaver

Repository files navigation

ModelWeaver

ModelWeaver

Multi-provider LLM proxy for Claude Code. Route different agent roles to different model providers with automatic fallback, racing, circuit breakers, and a native desktop GUI.

30-Second Setup

npx @kianwoon/modelweaver init    # pick your provider, paste your API key
npx @kianwoon/modelweaver         # start the proxy

# In another terminal — point Claude Code at ModelWeaver:
export ANTHROPIC_BASE_URL=http://localhost:3456
export ANTHROPIC_API_KEY=unused-but-required
claude

No config file editing. No provider SDK installs. The wizard tests your API key and generates the config automatically.

Full setup guide · All CLI commands · Configuration reference


CI CodeQL Release License: Apache-2.0 npm GitHub stars

Screenshot 2026-04-06 at 7 31 49 PM

What's New — v0.3.73

  • OpenAI-compatible upstream providers — Chat Completions (/v1/chat/completions) and Responses API (/v1/responses) formats with full Anthropic downstream compatibility (#240)
  • Smart request routing — classify message content by complexity and route to the appropriate model tier automatically (#97)
  • Single-provider hedge skip — hedging disabled for single-provider chains, prevents rate-limit amplification (#231)
  • 408/504 retry with fresh pool — request timeout and server-unavailable now retry with a new connection pool (#231)
  • Transient error detection — detect and retry on transient errors in 400/413 response bodies (#230)
  • GET /v1/models endpoint — list available models from configured providers (#229)
  • Retry-After header support — respect provider rate-limit backoff for 429/503 responses (#228)
  • Streaming-only token speed — TTFB excluded from token-per-second calculations for accurate metrics (#227)
  • Per-model connection pools — each model gets its own HTTP/2 connection for TCP isolation (#186)
  • GOAWAY-aware retry — graceful HTTP/2 drain no longer marks pool as "failed" (#188)

View all releases · Full changelog


How It Works

ModelWeaver sits between Claude Code and upstream model providers as a local HTTP proxy. It inspects the model field in each Anthropic Messages API request and routes it to the best-fit provider.

Claude Code  ──→  ModelWeaver  ──→  Anthropic (primary)       [Anthropic format]
                   (localhost)   ──→  OpenRouter (fallback)    [Anthropic format]
                                  ──→  OpenAI-compatible        [OpenAI format]
                   │
              0. Classify message content → tier override? (smartRouting)
              1. Match exact model name (modelRouting)
              2. Match tier via substring (tierPatterns)
              3. Translate request/response per provider apiFormat
              4. Fallback on 429 / 5xx errors
              5. Race remaining providers on 429

Features

  • Smart request routing — classify request complexity by message content (regex keyword scoring) and override the model tier automatically
  • Tier-based routing — route by model family (sonnet/opus/haiku) using substring pattern matching
  • Exact model routing — route specific model names to dedicated providers (checked first)
  • Automatic fallback — transparent failover on rate limits (429) and server errors (5xx)
  • Adaptive racing — on 429, automatically races remaining providers simultaneously
  • Model name rewriting — each provider in the chain can use a different model name
  • Weighted distribution — spread traffic across providers by weight percentage
  • Circuit breaker — per-provider circuit breaker with closed/open/half-open states, prevents hammering unhealthy providers
  • Request hedging — sends multiple copies when a provider shows high latency variance (CV > 0.5), returns the fastest response (skipped for single-provider chains to avoid rate-limit amplification)
  • TTFB timeout — fails slow providers before full timeout elapses (configurable per provider)
  • Stall detection — detects stalled streams and aborts them, triggering fallback
  • Connection pooling — per-provider undici Agent dispatcher with configurable pool size
  • Per-model connection pools — isolate HTTP/2 connections per model via modelPools config for TCP-level isolation
  • Connection retry — automatic retry with exponential backoff for stale connections, TTFB timeouts, and GOAWAY drains
  • Session agent pooling — reuses HTTP/2 agents across requests within the same session for connection affinity
  • Adaptive TTFB — dynamically adjusts TTFB timeout based on observed latency history
  • GOAWAY-aware retry — graceful HTTP/2 GOAWAY drain no longer marks pool as "failed"
  • OpenAI-compatible upstream — translate Anthropic Messages API requests to OpenAI Chat Completions or Responses API format in real time, with SSE stream translation back to Anthropic format
  • Stream buffering — optional time-based and size-based SSE buffering (streamBufferMs, streamBufferBytes)
  • Health scores — per-provider health scoring based on latency and error rates
  • Provider error tracking — per-provider error counts with status code breakdown, displayed in GUI in real-time
  • Concurrent limits — cap concurrent requests per provider
  • Interactive setup wizard — guided configuration with API key validation, hedging config, and provider editing
  • Config hot-reload — changes to config file are picked up automatically, no restart needed
  • Daemon mode — background process with auto-restart, launchd integration, and reload support
  • Desktop GUI — native Tauri app with real-time progress bars, provider health, error breakdown, and recent request history

Why ModelWeaver

Single-Provider Is a Hobby Setup

Relying on one LLM provider is fine for experiments. For serious development, it's a liability. When your provider degrades — rate limits, slow tokens, stalled streams, outright outages — your entire coding session freezes. A 1-hour task becomes a 3-hour wait-and-retry loop.

ModelWeaver gives you high availability for AI coding — multiple providers, automatic failover, and intelligent traffic management. When one provider goes down, you don't even notice.

What happens without ModelWeaver:

10:00  Coding session starts — everything's fast
10:30  Token generation slows from 80 tok/s to 3 tok/s
10:35  Stream stalls mid-response — you wait
10:40  Retry — 429 rate limit — you wait more
10:50  Another retry — 502 error — you give up
11:30  Start over, lost context
Result: 1-hour job took 4 hours

What happens with ModelWeaver:

10:00  Coding session starts — ModelWeaver routes to Provider A
10:30  Provider A slows down — hedging detects high latency variance
       → sends 2 copies simultaneously, returns the fastest
10:35  Provider A stalls — stall detection aborts in seconds
       → transparent fallback to Provider B, stream continues
10:40  Provider A rate limits (429) — remaining providers race simultaneously
       → recovery in <2s, no context lost
Result: 1-hour job took 1 hour

How It Works

Problem What ModelWeaver Does Recovery Time
Provider slows down Hedging sends 2-4 copies, returns fastest Instant
Stream stalls mid-response Stall detection aborts, falls back to next provider Seconds
429 rate limit Races all remaining providers simultaneously <2s
Provider goes down Circuit breaker opens, traffic rerouted automatically Seconds
All providers unhealthy Global backoff returns 503 immediately Immediate
Stale HTTP/2 connection Transparent retry with exponential backoff Transparent
Provider returns errors Health-score reordering deprioritizes bad providers Rolling 5-min window

Cost Optimization Without Quality Loss

Running everything through one provider at premium rates gets expensive. A full Claude Code session with multiple subagents generates 50-100+ API calls.

  • Weighted routing with health blending: Traffic automatically shifts toward healthier and cheaper providers when one degrades
  • Tier-based routing: Haiku-tier Explore agents (cheap, fast) never accidentally hit Opus-tier pricing. Sonnet coding agents don't burn expensive Opus tokens
  • Model rewriting per provider: The same claude-sonnet-4-6 model name can route to different models on different providers — zero config changes in Claude Code

Operational Visibility

When coding through a proxy, you're normally blind to why responses are slow or failing.

  • Desktop GUI: Real-time progress bars showing which provider handled each request, response time, and whether hedging fired
  • Health scores API: curl /api/health-scores shows per-provider scores (0-1). A score of 0.3 means the provider is failing ~50% of requests
  • Error breakdown: Per-provider error counts with status code breakdown — spot patterns like a provider returning 502s consistently
  • Circuit breaker state: See which providers are open/closed/half-open in real-time

Zero-Downtime Configuration

  • Hot-reload (300ms debounce): Edit config.yaml and the daemon picks up changes automatically. No restart, no killed in-flight requests
  • SIGHUP reload: After rebuilding from source, modelweaver reload restarts the worker without killing the monitor

Prerequisites

  • Node.js 20 or later — Install Node.js
  • npx — included with Node.js (no separate install needed)

Installation

ModelWeaver requires no permanent install — npx downloads and runs it on the fly. But if you prefer a global install:

npm install -g @kianwoon/modelweaver

After that, replace npx @kianwoon/modelweaver with modelweaver (or the shorter mw) in all commands below.

Quick Start

1. Run the setup wizard

npx @kianwoon/modelweaver init

The wizard guides you through:

  • Selecting from 6 preset providers (Anthropic, OpenRouter, Together AI, GLM/Z.ai, Minimax, Fireworks)
  • Testing API keys to verify connectivity
  • Setting up model routing tiers and hedging config
  • Creating ~/.modelweaver/config.yaml and ~/.modelweaver/.env

2. Start ModelWeaver

# Foreground (see logs in terminal)
npx @kianwoon/modelweaver

# Background daemon (auto-restarts on crash)
npx @kianwoon/modelweaver start

# Install as launchd service (auto-start at login)
npx @kianwoon/modelweaver install

3. Point Claude Code to ModelWeaver

export ANTHROPIC_BASE_URL=http://localhost:3456
export ANTHROPIC_API_KEY=unused-but-required
claude

CLI Commands

npx @kianwoon/modelweaver init              # Interactive setup wizard
npx @kianwoon/modelweaver start             # Start as background daemon
npx @kianwoon/modelweaver stop              # Stop background daemon
npx @kianwoon/modelweaver status            # Show daemon status + service state
npx @kianwoon/modelweaver remove            # Stop daemon + remove PID and log files
npx @kianwoon/modelweaver reload            # Reload daemon worker (after rebuild)
npx @kianwoon/modelweaver install           # Install launchd service (auto-start at login)
npx @kianwoon/modelweaver uninstall         # Uninstall launchd service
npx @kianwoon/modelweaver gui               # Launch desktop GUI (auto-downloads binary)
npx @kianwoon/modelweaver [options]         # Run in foreground

CLI Options

  -p, --port <number>      Server port                    (default: from config)
  -c, --config <path>      Config file path               (auto-detected)
  -v, --verbose            Enable debug logging           (default: off)
  -h, --help               Show help

Init Options

  --global                 Edit global config only
  --path <file>            Write config to a specific file

Daemon Mode

Run ModelWeaver as a background process that survives terminal closure and auto-recovers from crashes.

npx @kianwoon/modelweaver start             # Start (forks monitor + daemon)
npx @kianwoon/modelweaver status            # Check if running
npx @kianwoon/modelweaver reload            # Reload worker after rebuild
npx @kianwoon/modelweaver stop              # Graceful stop (SIGTERM → SIGKILL after 5s)
npx @kianwoon/modelweaver remove            # Stop + remove PID file + log file
npx @kianwoon/modelweaver install           # Install launchd service
npx @kianwoon/modelweaver uninstall         # Uninstall launchd service

How it works: start forks a lightweight monitor process that owns the PID file. The monitor spawns the actual daemon worker. If the worker crashes, the monitor auto-restarts it with exponential backoff starting at 500ms (up to 10 attempts). After 60 seconds of stable running, the restart counter resets.

modelweaver.pid        → Monitor process (handles signals, watches child)
  └── modelweaver.worker.pid → Daemon worker (runs HTTP server)

Files:

  • ~/.modelweaver/modelweaver.pid — monitor PID
  • ~/.modelweaver/modelweaver.worker.pid — worker PID
  • ~/.modelweaver/modelweaver.log — daemon output log

Desktop GUI

ModelWeaver ships a native desktop GUI built with Tauri. No Rust toolchain needed — the binary is auto-downloaded from GitHub Releases.

npx @kianwoon/modelweaver gui

First run downloads the latest binary for your platform (~10-30 MB). Subsequent launches use the cached version.

GUI features:

  • Real-time progress bars with provider name and model info
  • Provider health cards with error counts and status code breakdown
  • Recent request history sorted by timestamp
  • Config validation error banner
  • Auto-reconnect on daemon restart

Supported platforms:

Platform Format
macOS (Apple Silicon) .dmg
macOS (Intel) .dmg
Linux (x86_64) .AppImage
Windows (x86_64) .msi

Cached files are stored in ~/.modelweaver/gui/ with version tracking — new versions download automatically on the next gui launch.

Configuration

Config file locations

Checked in order (first found wins):

  1. ./modelweaver.yaml (project-local)
  2. ~/.modelweaver/config.yaml (user-global)

Full config schema

server:
  port: 3456                  # Server port          (default: 3456)
  host: localhost             # Bind address         (default: localhost)
  streamBufferMs: 0           # Time-based stream flush threshold  (default: disabled)
  streamBufferBytes: 0        # Size-based stream flush threshold  (default: disabled)
  globalBackoffEnabled: true  # Global backoff on repeated failures (default: true)
  unhealthyThreshold: 0.5     # Health score below which provider is unhealthy (default: 0.5, 0–1)
  maxBodySizeMB: 10           # Max request body size in MB        (default: 10, 1–100)
  sessionIdleTtlMs: 600000    # Session agent pool idle TTL in ms  (default: 600000 / 10min, min: 60000)
  disableThinking: false      # Strip thinking blocks from requests (default: false)

# Adaptive request hedging
hedging:
  speculativeDelay: 500       # ms before starting backup providers  (default: 500)
  cvThreshold: 0.5            # latency CV threshold for hedging    (default: 0.5)
  maxHedge: 4                 # max concurrent copies per request    (default: 4)

providers:
  anthropic:
    baseUrl: https://api.anthropic.com
    apiKey: ${ANTHROPIC_API_KEY}  # Env var substitution
    timeout: 20000                # Request timeout in ms  (default: 20000)
    ttfbTimeout: 8000             # TTFB timeout in ms     (default: 8000)
    stallTimeout: 15000           # Stall detection timeout (default: 15000)
    poolSize: 10                  # Connection pool size   (default: 10)
    concurrentLimit: 10           # Max concurrent requests (default: unlimited)
    connectionRetries: 3          # Retries for stale connections (default: 3, max: 10)
    staleAgentThresholdMs: 30000  # Mark pooled agent stale after idle ms (optional)
    rateLimitBackoffMs: 2000      # Backoff after 429/503 in ms (optional, overrides Retry-After)
    retryableErrorPatterns:       # Substring patterns for retryable error messages (optional, case-insensitive)
      - "network error"
      - "system error"
    modelPools:                   # Per-model pool size overrides (optional)
      "claude-sonnet-4-20250514": 20
    modelLimits:                  # Per-provider token limits (optional)
      maxOutputTokens: 16384
    authType: anthropic           # "anthropic" | "bearer"  (default: anthropic)
    apiFormat: anthropic          # "anthropic" | "openai-chat" | "openai-responses"  (default: anthropic)
    circuitBreaker:               # Per-provider circuit breaker (optional)
      failureThreshold: 3         # Failures before opening circuit (alias: threshold, default: 3)
      windowSeconds: 60           # Time window for failure count  (default: 60)
      cooldownSeconds: 30         # Cooldown in seconds (alias: cooldown, also in seconds, default: 30)
      rateLimitCooldownSeconds: 10  # Shorter cooldown for 429 rate limits (optional)
  openrouter:
    baseUrl: https://openrouter.ai/api
    apiKey: ${OPENROUTER_API_KEY}
    authType: bearer
    timeout: 60000
  # OpenAI-compatible upstream (translates Anthropic → OpenAI format)
  openai-compatible:
    baseUrl: https://api.openai.com
    apiKey: ${OPENAI_API_KEY}
    authType: bearer
    apiFormat: openai-chat           # or "openai-responses" for Responses API
    timeout: 30000

# Exact model name routing (checked FIRST, before tier patterns)
modelRouting:
  "glm-5-turbo":
    - provider: anthropic
  "MiniMax-M2.7":
    - provider: openrouter
      model: minimax/MiniMax-M2.7        # With model name rewrite
  # Weighted distribution example:
  # "claude-sonnet-4":
  #   - provider: anthropic
  #     weight: 70
  #   - provider: openrouter
  #     weight: 30

# Tier-based routing (fallback chain)
routing:
  sonnet:
    - provider: anthropic
      model: claude-sonnet-4-20250514      # Optional: rewrite model name
    - provider: openrouter
      model: anthropic/claude-sonnet-4      # Fallback
  opus:
    - provider: anthropic
      model: claude-opus-4-20250514
  haiku:
    - provider: anthropic
      model: claude-haiku-4-5-20251001

# Pattern matching: model name includes any string → matched to tier
tierPatterns:
  sonnet: ["sonnet", "3-5-sonnet", "3.5-sonnet"]
  opus: ["opus", "3-opus", "3.5-opus"]
  haiku: ["haiku", "3-haiku", "3.5-haiku"]

# Smart request routing — classify message content and override model tier
# When enabled, analyzes the last user message against regex patterns.
# If cumulative score >= escalationThreshold, routes to the classified tier
# instead of the model requested. Disabled by default.
# smartRouting:
#   enabled: true
#   escalationThreshold: 2    # minimum score to trigger tier override
#   patterns:
#     "1":                     # Tier 1 — best model (e.g., opus-tier)
#       - pattern: "architect|design system|from scratch"
#         score: 3
#       - pattern: "debug|troubleshoot|investigate|root cause"
#         score: 2
#     "2":                     # Tier 2 — good model (e.g., sonnet-tier)
#       - pattern: "explain|summarize|compare"
#         score: 2
#       - pattern: "write.*test|refactor|review"
#         score: 2
# Requires matching routing entries: routing.tier1, routing.tier2 (tier3 optional)
# Graceful degradation: if classified tier has no providers, tries next tier down

Routing priority

  1. Smart content routing (smartRouting) — if enabled and message content matches classification patterns, override to the classified tier (bypasses all other routing)
  2. Exact model name (modelRouting) — if the request model matches exactly, use that route
  3. Weighted distribution — if the model has weight entries, requests are distributed across providers proportionally
  4. Tier pattern (tierPatterns + routing) — substring match the model name against patterns, then use the tier's provider chain
  5. No match — returns 502 with a descriptive error listing configured tiers and model routes

Provider chain behavior

  • First provider is primary, rest are fallbacks
  • Fallback triggers on: 429 (rate limit), 5xx (server error), network timeout, stream stall
  • Adaptive race mode — when a 429 is received, remaining providers are raced simultaneously (not sequentially) for faster recovery
  • Circuit breaker — providers that repeatedly fail are temporarily skipped (auto-recovers after cooldown, configurable window)
  • Hedging skip — single-provider chains skip hedging entirely (multi-copy to one provider amplifies rate limits without improving outcome)
  • No fallback on: 4xx (bad request, auth failure, forbidden) — returned immediately (except 429 and transient errors in 400/413 bodies)
  • Model rewriting: each provider entry can override the model field in the request body

OpenAI-Compatible Providers

ModelWeaver can proxy to any OpenAI-compatible API while keeping the downstream (Claude Code) connection in Anthropic format. Set apiFormat on the provider to control the upstream wire format:

apiFormat Upstream Endpoint SSE Translation
anthropic (default) /v1/messages Passthrough
openai-chat /v1/chat/completions OpenAI chunks → Anthropic SSE events
openai-responses /v1/responses Responses API events → Anthropic SSE events

What gets translated:

  • Request body: system → instructions, messages → input/chat format, tool_use → tool_calls, tool_result → role:tool, thinking → reasoning, max_tokens → max_output_tokens
  • Headers: x-api-keyAuthorization: Bearer, strips anthropic-version/anthropic-beta
  • SSE stream: OpenAI choices[].delta chunks → Anthropic content_block_start/delta/stop events with proper message lifecycle (message_start → content blocks → message_delta → message_stop)
  • Errors: OpenAI error JSON → Anthropic {type:"error", error:{type,message}} format

Supported features: text, tool calls, streaming, thinking/reasoning, vision (image blocks), system messages, cache control (stripped).

Example — OpenAI provider:

providers:
  my-openai:
    baseUrl: https://api.openai.com
    apiKey: ${OPENAI_API_KEY}
    authType: bearer
    apiFormat: openai-chat

routing:
  sonnet:
    - provider: my-openai
      model: gpt-4o

Example — Responses API provider:

providers:
  my-responses:
    baseUrl: https://api.example.com
    apiKey: ${EXAMPLE_API_KEY}
    authType: bearer
    apiFormat: openai-responses

Supported providers

Provider Auth Type Base URL
Anthropic x-api-key https://api.anthropic.com
OpenRouter Bearer https://openrouter.ai/api
Together AI Bearer https://api.together.xyz
GLM (Z.ai) x-api-key https://api.z.ai/api/anthropic
Minimax x-api-key https://api.minimax.io/anthropic
Fireworks Bearer https://api.fireworks.ai/inference/v1

Any OpenAI/Anthropic-compatible API works — set baseUrl, authType, and apiFormat appropriately. For OpenAI-compatible endpoints, set apiFormat: openai-chat or apiFormat: openai-responses.

Config hot-reload

In daemon mode, ModelWeaver watches the config file for changes and reloads automatically (debounced 300ms). You can also send a manual reload signal:

kill -SIGHUP $(cat ~/.modelweaver/modelweaver.pid)

Or use the CLI:

npx @kianwoon/modelweaver reload

Re-running npx @kianwoon/modelweaver init also signals the running daemon to reload.

API

Health check

curl http://localhost:3456/api/status

Returns circuit breaker state for all providers and server uptime.

Version

curl http://localhost:3456/api/version

Returns the running ModelWeaver version.

Available models

curl http://localhost:3456/v1/models

Returns the list of available models from configured providers (Anthropic-compatible format).

Connection pool status

curl http://localhost:3456/api/pool

Returns active connection pool state for all providers.

Health scores

curl http://localhost:3456/api/health-scores

Returns per-provider health scores based on latency and error rates.

Session pool status

curl http://localhost:3456/api/sessions

Returns session agent pool statistics.

Observability

# Aggregated request metrics (by model, provider, error type)
curl http://localhost:3456/api/metrics/summary

# Per-provider circuit breaker state
curl http://localhost:3456/api/circuit-breaker

# Hedging win/loss statistics
curl http://localhost:3456/api/hedging/stats

How Claude Code Uses Model Tiers

Claude Code sends different model names for different agent roles:

Agent Role Model Tier Typical Model Name
Main conversation, coding Sonnet claude-sonnet-4-20250514
Explore (codebase search) Haiku claude-haiku-4-5-20251001
Plan (analysis) Sonnet claude-sonnet-4-20250514
Complex subagents Opus claude-opus-4-20250514
GLM/Z.ai models Exact routing glm-5-turbo
MiniMax models Exact routing MiniMax-M2.7

ModelWeaver uses the model name to determine which agent tier is calling, then routes accordingly.

Development

npm install          # Install dependencies
npm test             # Run tests (307 tests)
npm run build        # Build for production (tsup)
npm run dev          # Run in dev mode (tsx)

FAQ

Why do I need ANTHROPIC_API_KEY=unused-but-required?

Claude Code validates that ANTHROPIC_API_KEY is set before connecting. ModelWeaver handles real auth to upstream providers — the env var just satisfies Claude Code's startup check.

Port 3456 is already in use.

Something else is running on that port. Either stop it, or set a different port in your config:

server:
  port: 8080

Then update ANTHROPIC_BASE_URL to match.

How do I know ModelWeaver is running?

curl http://localhost:3456/api/status

Returns JSON with uptime and circuit breaker state. Or check the GUI:

npx @kianwoon/modelweaver gui

How do I switch providers?

Run npx @kianwoon/modelweaver init again — it opens your existing config for editing. Or edit ~/.modelweaver/config.yaml directly (hot-reloaded automatically in daemon mode).

Can I use OpenAI-compatible providers with Claude Code?

Yes. Set apiFormat: openai-chat (for /v1/chat/completions) or apiFormat: openai-responses on the provider. ModelWeaver translates Anthropic Messages API requests to OpenAI format upstream, then translates responses back to Anthropic format for Claude Code. Features like tools, thinking, vision, and system messages are fully supported.

providers:
  my-provider:
    baseUrl: https://api.example.com
    apiKey: ${EXAMPLE_API_KEY}
    authType: bearer
    apiFormat: openai-chat

License

Apache-2.0

About

Multi-provider model orchestration proxy for Claude Code. Route agent roles (planning, coding, research) to different LLM providers with automatic fallback, daemon mode, desktop GUI, config hot-reload, and crash recovery.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors