Local AI vs Cloud
The Benchmark
Qwen3.5-9B scores 93.8% — within 4 points of GPT-5.4 — running entirely on a MacBook Pro M5 at 25 tok/s, 765ms TTFT, using only 13.8 GB of unified memory. Zero API costs. Full data privacy. All local.
MacBook Pro M5 · M5 Pro · 18 cores · 64 GB Unified Memory · macOS 15.3 (arm64) · llama.cpp
Full Leaderboard
96-test evaluation across 15 suites covering tool use, security classification, event deduplication, and more.
| Rank | Model | Type | Passed | Failed | Pass Rate | Time |
|---|---|---|---|---|---|---|
| 🥇 | GPT-5.4 | ☁️ Cloud | 94 | 2 | 97.9% | 2m 22s |
| 🥈 | GPT-5.4-mini | ☁️ Cloud | 92 | 4 | 95.8% | 1m 17s |
| 🥉 | Qwen3.5-9B (Q4_K_M) | 🏠 Local | 90 | 6 | 93.8% | 5m 23s |
| 🥉 | Qwen3.5-27B (Q4_K_M) | 🏠 Local | 90 | 6 | 93.8% | 15m 8s |
| 5 | Qwen3.5-122B-MoE (IQ1_M) | 🏠 Local | 89 | 7 | 92.7% | 8m 26s |
| 5 | GPT-5.4-nano | ☁️ Cloud | 89 | 7 | 92.7% | 1m 34s |
| 7 | Qwen3.5-35B-MoE (Q4_K_L) | 🏠 Local | 88 | 8 | 91.7% | 3m 30s |
| 8 | GPT-5-mini (2025) | ☁️ Cloud | 60 | 36 | 62.5% | 7m 38s |
* GPT-5-mini had many failures due to the API rejecting non-default temperature values — listed for completeness only.
Key takeaway: The Qwen3.5-9B running locally on a single laptop scores 93.8% — only 4.1 points behind GPT-5.4 and within 2 points of GPT-5.4-mini. It even beats GPT-5.4-nano by 1 point. All with zero API costs and complete data privacy.
Performance: Local vs Cloud
The Qwen3.5-35B-MoE has a lower TTFT than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.
Time to First Token (avg)
Lower is better
Decode Speed
Higher is better · tokens/second
GPU Memory Usage (Local Models)
What is HomeSec-Bench?
A benchmark we created to evaluate LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs.
All 35 fixture images are AI-generated (no real user footage). Tests run against any OpenAI-compatible endpoint.
Context Preprocessing
6Deduplicating conversations, preserving system msgs
Topic Classification
4Routing queries to the right domain
Knowledge Distillation
5Extracting durable facts from conversations
Event Deduplication
8"Same person or new visitor?" across cameras
Tool Use
16Selecting correct tools with correct parameters
Chat & JSON Compliance
11Persona, JSON output, multilingual
Security Classification
12Normal → Monitor → Suspicious → Critical triage
Narrative Synthesis
4Summarizing event logs into daily reports
Prompt Injection Resistance
4Role confusion, prompt extraction, escalation
Multi-Turn Reasoning
4Reference resolution, temporal carry-over
Error Recovery
4Handling impossible queries, API errors
Privacy & Compliance
3PII redaction, illegal surveillance rejection
Alert Routing
5Channel routing, quiet hours parsing
Knowledge Injection
5Using injected KIs to personalize responses
VLM-to-Alert Triage
5End-to-end: VLM output → urgency → alert dispatch
Why This Matters
See It Run
Watch the benchmark suite execute live on Apple Silicon — every test visible in real time.
A 9B model on a laptop scoring within 4% of GPT-5.4 on domain tasks — fully offline with complete privacy — is the value proposition of local AI.
System: Aegis-AI — Local-first AI home security on consumer hardware.
Benchmark: HomeSec-Bench — 96 LLM + 35 VLM tests across 16 suites.
Skill Platform: DeepCamera — Decentralized AI skill ecosystem.