Goal: Find the exact task-complexity ceiling where a local Qwen3-MoE 30.5B running in Ollama, driven by the claw-code Rust harness, sustains autonomous multi-step tool use — and where it collapses.
This is not a vibe check. It is a reproducible, publishable boundary test: local sovereign compute vs. cloud API agents, in 2026, on real agentic workloads.
Motivation: If a cloud AI blackout occurs — data center loss, geopolitical disruption, infrastructure attack — you need a local fallback that actually works. This study tells you which model to trust and why.
Verdict: use open-claw (Qwen3-MoE 30.5B) as your primary local fallback.
| Test | Weight | gemma4 (8B) | open-claw (30.5B MoE) |
|---|---|---|---|
| Basic response | 1 | PASS 2516ms | PASS 2359ms |
| Tool call emission | 3 | PASS 7922ms | FAIL 6735ms |
| Strict JSON output | 2 | PASS 5047ms | PASS 2890ms |
| Multi-step planning | 2 | FAIL 36203ms | PASS 6860ms |
| Tool input valid JSON | 3 | PASS 6656ms | PASS 6718ms |
| Context retention | 2 | PASS 3641ms | PASS 3204ms |
| Graceful ambiguity handling | 1 | PASS 23406ms | PASS 4671ms |
| Code generation | 2 | PASS 5437ms | PASS 2641ms |
| Multiple tool calls | 2 | PASS 10391ms | PASS 7906ms |
| Latency burst (3 calls) | 1 | PASS 2417ms | PASS 2260ms |
| Weighted score | 89.5% | 84.2% | |
| Median latency | 6047ms | 3938ms | |
| P95 latency | 36203ms | 7906ms |
- The 5.3% score gap is within noise. What is not noise: gemma4's P95 latency is 36 seconds. Under incident conditions, a model that freezes for 36s on a planning task is operationally useless.
- Multi-step planning FAIL on gemma4 is a hard blocker. Agentic loops require chained reasoning. A model that cannot plan 3 steps reliably will collapse on any real workload.
- open-claw's tool call emission failure is inconsistent across runs — likely a prompt sensitivity issue, not a capability gap. gemma4's planning failure is deterministic.
- open-claw is faster at median (3938ms vs 6047ms) despite being 4x larger. MoE routing wins.
| Model | Params | Est. GPU Draw | Context Window |
|---|---|---|---|
| gemma4 | 8B Q4_K_M | ~60-80W | 131K |
| open-claw (Qwen3-MoE) | 30.5B Q4_K_M | ~250-350W | 262K |
On solar/salt-tank backup, open-claw draws ~4x more power. That is a real cost. But a fast, wrong answer costs more than a correct answer at higher wattage. If you have the power budget: run open-claw. If you are severely power-constrained and your workload is simple Q&A (not agentic), gemma4 suffices.
| Layer | Component |
|---|---|
| Model | open-claw — Qwen3-MoE 30.5B, Q4_K_M, 262K ctx, tool-use capable |
| Harness | rust/ — claw-code Rust rewrite, 20K LoC, 6 crates, binary: claw.exe |
| Bridge | ollama_proxy.py — translates Anthropic /v1/messages to Ollama OpenAI format |
| Eval | compare.py — 10-test weighted suite, reproducible |
claw.exe --> ANTHROPIC_BASE_URL=http://localhost:8787
|
ollama_proxy.py
(Anthropic <-> OpenAI format translation)
|
Ollama --> open-claw (Qwen3-MoE 30.5B)
running on YOUR machine. No cloud. No data center.
# 1. Build the harness
cd rust && cargo build --release && cd ..
# 2. Start the proxy (separate terminal)
python ollama_proxy.py --port 8787
# 3. Run the comparison
python compare.py --proxy http://localhost:8787 --save my_results.md
# 4. Interactive session with open-claw via claw harness
set ANTHROPIC_BASE_URL=http://localhost:8787
set ANTHROPIC_API_KEY=ollama
rust\target\release\claw --model open-clawRequirements: Rust 1.70+, Python 3.11+, Ollama with open-claw and gemma4 pulled.
rust/ claw-code Rust harness (build here)
crates/
api/ Anthropic API client + SSE streaming
runtime/ ConversationRuntime, config, permissions, MCP, session
tools/ Bash, ReadFile, WriteFile, EditFile, Grep, Glob, Agent...
rusty-claude-cli/ REPL + one-shot prompt binary
ollama_proxy.py Anthropic <-> Ollama format bridge
compare.py Head-to-head evaluation suite
comparison_report.md Raw results from April 2026 run
src/ Python porting workspace (reference only)
tests/ Python verification surface
This repository is not affiliated with or endorsed by Anthropic.