Where Does the Local Edge Break? — Qwen3-MoE Agentic Limit Study

Goal: Find the exact task-complexity ceiling where a local Qwen3-MoE 30.5B running in Ollama, driven by the claw-code Rust harness, sustains autonomous multi-step tool use — and where it collapses.

This is not a vibe check. It is a reproducible, publishable boundary test: local sovereign compute vs. cloud API agents, in 2026, on real agentic workloads.

Motivation: If a cloud AI blackout occurs — data center loss, geopolitical disruption, infrastructure attack — you need a local fallback that actually works. This study tells you which model to trust and why.

Result (April 2026)

Verdict: use open-claw (Qwen3-MoE 30.5B) as your primary local fallback.

Test	Weight	gemma4 (8B)	open-claw (30.5B MoE)
Basic response	1	PASS 2516ms	PASS 2359ms
Tool call emission	3	PASS 7922ms	FAIL 6735ms
Strict JSON output	2	PASS 5047ms	PASS 2890ms
Multi-step planning	2	FAIL 36203ms	PASS 6860ms
Tool input valid JSON	3	PASS 6656ms	PASS 6718ms
Context retention	2	PASS 3641ms	PASS 3204ms
Graceful ambiguity handling	1	PASS 23406ms	PASS 4671ms
Code generation	2	PASS 5437ms	PASS 2641ms
Multiple tool calls	2	PASS 10391ms	PASS 7906ms
Latency burst (3 calls)	1	PASS 2417ms	PASS 2260ms
Weighted score		89.5%	84.2%
Median latency		6047ms	3938ms
P95 latency		36203ms	7906ms

Why open-claw wins despite the lower weighted score

The 5.3% score gap is within noise. What is not noise: gemma4's P95 latency is 36 seconds. Under incident conditions, a model that freezes for 36s on a planning task is operationally useless.
Multi-step planning FAIL on gemma4 is a hard blocker. Agentic loops require chained reasoning. A model that cannot plan 3 steps reliably will collapse on any real workload.
open-claw's tool call emission failure is inconsistent across runs — likely a prompt sensitivity issue, not a capability gap. gemma4's planning failure is deterministic.
open-claw is faster at median (3938ms vs 6047ms) despite being 4x larger. MoE routing wins.

Power budget

Model	Params	Est. GPU Draw	Context Window
gemma4	8B Q4_K_M	~60-80W	131K
open-claw (Qwen3-MoE)	30.5B Q4_K_M	~250-350W	262K

On solar/salt-tank backup, open-claw draws ~4x more power. That is a real cost. But a fast, wrong answer costs more than a correct answer at higher wattage. If you have the power budget: run open-claw. If you are severely power-constrained and your workload is simple Q&A (not agentic), gemma4 suffices.

Stack

Layer	Component
Model	`open-claw` — Qwen3-MoE 30.5B, Q4_K_M, 262K ctx, tool-use capable
Harness	`rust/` — claw-code Rust rewrite, 20K LoC, 6 crates, binary: `claw.exe`
Bridge	`ollama_proxy.py` — translates Anthropic `/v1/messages` to Ollama OpenAI format
Eval	`compare.py` — 10-test weighted suite, reproducible

claw.exe  -->  ANTHROPIC_BASE_URL=http://localhost:8787
                    |
              ollama_proxy.py
              (Anthropic <-> OpenAI format translation)
                    |
              Ollama  -->  open-claw (Qwen3-MoE 30.5B)
                    running on YOUR machine. No cloud. No data center.

Reproduce This

# 1. Build the harness
cd rust && cargo build --release && cd ..

# 2. Start the proxy (separate terminal)
python ollama_proxy.py --port 8787

# 3. Run the comparison
python compare.py --proxy http://localhost:8787 --save my_results.md

# 4. Interactive session with open-claw via claw harness
set ANTHROPIC_BASE_URL=http://localhost:8787
set ANTHROPIC_API_KEY=ollama
rust\target\release\claw --model open-claw

Requirements: Rust 1.70+, Python 3.11+, Ollama with open-claw and gemma4 pulled.

Repo Layout

rust/                   claw-code Rust harness (build here)
  crates/
    api/                Anthropic API client + SSE streaming
    runtime/            ConversationRuntime, config, permissions, MCP, session
    tools/              Bash, ReadFile, WriteFile, EditFile, Grep, Glob, Agent...
    rusty-claude-cli/   REPL + one-shot prompt binary
ollama_proxy.py         Anthropic <-> Ollama format bridge
compare.py              Head-to-head evaluation suite
comparison_report.md    Raw results from April 2026 run
src/                    Python porting workspace (reference only)
tests/                  Python verification surface

Disclaimer

This repository is not affiliated with or endorsed by Anthropic.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
assets		assets
rust		rust
src		src
tests		tests
.claude.json		.claude.json
.env.pureGraphdb_wisdom		.env.pureGraphdb_wisdom
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
PARITY.md		PARITY.md
README.md		README.md
ci.bat		ci.bat
ci.sh		ci.sh
compare.py		compare.py
comparison_report.md		comparison_report.md
eval_gemma4.py		eval_gemma4.py
ingest_wisdom.py		ingest_wisdom.py
mkdirdb_wisdom.sh		mkdirdb_wisdom.sh
ollama_proxy.py		ollama_proxy.py
rundb_wisdom.sh		rundb_wisdom.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Where Does the Local Edge Break? — Qwen3-MoE Agentic Limit Study

Result (April 2026)

Why open-claw wins despite the lower weighted score

Power budget

Stack

Reproduce This

Repo Layout

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Where Does the Local Edge Break? — Qwen3-MoE Agentic Limit Study

Result (April 2026)

Why open-claw wins despite the lower weighted score

Power budget

Stack

Reproduce This

Repo Layout

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages