HomeSec-Bench v1 · 96 LLM Tests · 15 Suites

Local AI vs Cloud
The Benchmark

Qwen3.5-9B scores 93.8% — within 4 points of GPT-5.4 — running entirely on a MacBook Pro M5 at 25 tok/s, 765ms TTFT, using only 13.8 GB of unified memory. Zero API costs. Full data privacy. All local.

93.8%
Pass Rate (9B Local)
25 tok/s
Decode Speed
13.8 GB
GPU Memory

MacBook Pro M5 · M5 Pro · 18 cores · 64 GB Unified Memory · macOS 15.3 (arm64) · llama.cpp

Full Leaderboard

96-test evaluation across 15 suites covering tool use, security classification, event deduplication, and more.

Rank Model Type Passed Failed Pass Rate Time
🥇 GPT-5.4 ☁️ Cloud 94 2 97.9% 2m 22s
🥈 GPT-5.4-mini ☁️ Cloud 92 4 95.8% 1m 17s
🥉 Qwen3.5-9B (Q4_K_M) 🏠 Local 90 6 93.8% 5m 23s
🥉 Qwen3.5-27B (Q4_K_M) 🏠 Local 90 6 93.8% 15m 8s
5 Qwen3.5-122B-MoE (IQ1_M) 🏠 Local 89 7 92.7% 8m 26s
5 GPT-5.4-nano ☁️ Cloud 89 7 92.7% 1m 34s
7 Qwen3.5-35B-MoE (Q4_K_L) 🏠 Local 88 8 91.7% 3m 30s
8 GPT-5-mini (2025) ☁️ Cloud 60 36 62.5% 7m 38s

* GPT-5-mini had many failures due to the API rejecting non-default temperature values — listed for completeness only.

💡

Key takeaway: The Qwen3.5-9B running locally on a single laptop scores 93.8% — only 4.1 points behind GPT-5.4 and within 2 points of GPT-5.4-mini. It even beats GPT-5.4-nano by 1 point. All with zero API costs and complete data privacy.

Performance: Local vs Cloud

The Qwen3.5-35B-MoE has a lower TTFT than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.

Time to First Token (avg)

Lower is better

Qwen3.5-35B-MoE
435ms
GPT-5.4-nano
508ms
GPT-5.4-mini
553ms
GPT-5.4
601ms
Qwen3.5-9B
765ms
Qwen3.5-122B-MoE
1627ms
Qwen3.5-27B
2156ms
Local Cloud

Decode Speed

Higher is better · tokens/second

GPT-5.4-mini
234.5
GPT-5.4-nano
136.4
GPT-5.4
73.4
Qwen3.5-35B-MoE
41.9
Qwen3.5-9B
25
Qwen3.5-122B-MoE
18
Qwen3.5-27B
10
Local Cloud

GPU Memory Usage (Local Models)

27.2 GB
Qwen3.5-35B-MoE
13.8 GB
Qwen3.5-9B
40.8 GB
Qwen3.5-122B-MoE
24.9 GB
Qwen3.5-27B

What is HomeSec-Bench?

A benchmark we created to evaluate LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs.

All 35 fixture images are AI-generated (no real user footage). Tests run against any OpenAI-compatible endpoint.

📋

Context Preprocessing

6

Deduplicating conversations, preserving system msgs

🏷️

Topic Classification

4

Routing queries to the right domain

🧠

Knowledge Distillation

5

Extracting durable facts from conversations

🔔

Event Deduplication

8

"Same person or new visitor?" across cameras

🔧

Tool Use

16

Selecting correct tools with correct parameters

💬

Chat & JSON Compliance

11

Persona, JSON output, multilingual

🚨

Security Classification

12

Normal → Monitor → Suspicious → Critical triage

📖

Narrative Synthesis

4

Summarizing event logs into daily reports

🛡️

Prompt Injection Resistance

4

Role confusion, prompt extraction, escalation

🔄

Multi-Turn Reasoning

4

Reference resolution, temporal carry-over

⚠️

Error Recovery

4

Handling impossible queries, API errors

🔒

Privacy & Compliance

3

PII redaction, illegal surveillance rejection

📡

Alert Routing

5

Channel routing, quiet hours parsing

💉

Knowledge Injection

5

Using injected KIs to personalize responses

🚨

VLM-to-Alert Triage

5

End-to-end: VLM output → urgency → alert dispatch

Why This Matters

✅ Can it pick the right tool with correct parameters?
✅ Can it classify "masked person at night" as Critical?
✅ Can it resist prompt injection in event descriptions?
✅ Can it deduplicate the same person across 3 cameras?
✅ Can it maintain context across multi-turn security conversations?

See It Run

Watch the benchmark suite execute live on Apple Silicon — every test visible in real time.

A 9B model on a laptop scoring within 4% of GPT-5.4 on domain tasks — fully offline with complete privacy — is the value proposition of local AI.

System: Aegis-AI — Local-first AI home security on consumer hardware.

Benchmark: HomeSec-Bench — 96 LLM + 35 VLM tests across 16 suites.

Skill Platform: DeepCamera — Decentralized AI skill ecosystem.