HomeSec-Bench v1 · 96 LLM Tests · 15 Suites

Local AI vs Cloud
The Benchmark

Qwen3.5-9B scores 93.8% — within 4 points of GPT-5.4 — running entirely on a MacBook Pro M5 at 25 tok/s, 765ms TTFT, using only 13.8 GB of unified memory. Zero API costs. Full data privacy. All local.

93.8%

Pass Rate (9B Local)

25 tok/s

Decode Speed

13.8 GB

GPU Memory

MacBook Pro M5 · M5 Pro · 18 cores · 64 GB Unified Memory · macOS 15.3 (arm64) · llama.cpp

Full Leaderboard

96-test evaluation across 15 suites covering tool use, security classification, event deduplication, and more.

Rank	Model	Type	Passed	Failed	Pass Rate	Time
🥇	GPT-5.4	☁️ Cloud	94	2	97.9%	2m 22s
🥈	GPT-5.4-mini	☁️ Cloud	92	4	95.8%	1m 17s
🥉	Qwen3.5-9B (Q4_K_M)	🏠 Local	90	6	93.8%	5m 23s
🥉	Qwen3.5-27B (Q4_K_M)	🏠 Local	90	6	93.8%	15m 8s
5	Qwen3.5-122B-MoE (IQ1_M)	🏠 Local	89	7	92.7%	8m 26s
5	GPT-5.4-nano	☁️ Cloud	89	7	92.7%	1m 34s
7	Qwen3.5-35B-MoE (Q4_K_L)	🏠 Local	88	8	91.7%	3m 30s
8	GPT-5-mini (2025)	☁️ Cloud	60	36	62.5%	7m 38s

* GPT-5-mini had many failures due to the API rejecting non-default temperature values — listed for completeness only.

💡

Key takeaway: The Qwen3.5-9B running locally on a single laptop scores 93.8% — only 4.1 points behind GPT-5.4 and within 2 points of GPT-5.4-mini. It even beats GPT-5.4-nano by 1 point. All with zero API costs and complete data privacy.

Performance: Local vs Cloud

The Qwen3.5-35B-MoE has a lower TTFT than all OpenAI cloud models — 435ms vs 508ms for GPT-5.4-nano.

Time to First Token (avg)

Lower is better

Qwen3.5-35B-MoE

435ms

GPT-5.4-nano

508ms

GPT-5.4-mini

553ms

GPT-5.4

601ms

Qwen3.5-9B

765ms

Qwen3.5-122B-MoE

1627ms

Qwen3.5-27B

2156ms

Local Cloud

Decode Speed

Higher is better · tokens/second

GPT-5.4-mini

234.5

GPT-5.4-nano

136.4

GPT-5.4

73.4

Qwen3.5-35B-MoE

41.9

Qwen3.5-9B

25

Qwen3.5-122B-MoE

18

Qwen3.5-27B

10

Local Cloud

GPU Memory Usage (Local Models)

27.2 GB

Qwen3.5-35B-MoE

13.8 GB

Qwen3.5-9B

40.8 GB

Qwen3.5-122B-MoE

24.9 GB

Qwen3.5-27B

What is HomeSec-Bench?

A benchmark we created to evaluate LLMs on real home security assistant workflows — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs.

All 35 fixture images are AI-generated (no real user footage). Tests run against any OpenAI-compatible endpoint.

📋

Context Preprocessing

6

Deduplicating conversations, preserving system msgs

🏷️

Topic Classification

4

Routing queries to the right domain

🧠

Knowledge Distillation

5

Extracting durable facts from conversations

🔔

Event Deduplication

8

"Same person or new visitor?" across cameras

🔧

Tool Use

16

Selecting correct tools with correct parameters

💬

Chat & JSON Compliance

11

Persona, JSON output, multilingual

🚨

Security Classification

12

Normal → Monitor → Suspicious → Critical triage

📖

Narrative Synthesis

4

Summarizing event logs into daily reports

🛡️

Prompt Injection Resistance

4

Role confusion, prompt extraction, escalation

🔄

Multi-Turn Reasoning

4

Reference resolution, temporal carry-over

⚠️

Error Recovery

4

Handling impossible queries, API errors

🔒

Privacy & Compliance

3

PII redaction, illegal surveillance rejection

📡

Alert Routing

5

Channel routing, quiet hours parsing

💉

Knowledge Injection

5

Using injected KIs to personalize responses

🚨

VLM-to-Alert Triage

5

End-to-end: VLM output → urgency → alert dispatch

Why This Matters

✅ Can it pick the right tool with correct parameters?

✅ Can it classify "masked person at night" as Critical?

✅ Can it resist prompt injection in event descriptions?

✅ Can it deduplicate the same person across 3 cameras?

✅ Can it maintain context across multi-turn security conversations?

See It Run

Watch the benchmark suite execute live on Apple Silicon — every test visible in real time.

A 9B model on a laptop scoring within 4% of GPT-5.4 on domain tasks — fully offline with complete privacy — is the value proposition of local AI.

Download Aegis Benchmark on GitHub

System: Aegis-AI — Local-first AI home security on consumer hardware.

Benchmark: HomeSec-Bench — 96 LLM + 35 VLM tests across 16 suites.

Skill Platform: DeepCamera — Decentralized AI skill ecosystem.

Local AI vs Cloud The Benchmark

Full Leaderboard

Performance: Local vs Cloud

Time to First Token (avg)

Decode Speed

GPU Memory Usage (Local Models)

What is HomeSec-Bench?

Context Preprocessing

Topic Classification

Knowledge Distillation

Event Deduplication

Tool Use

Chat & JSON Compliance

Security Classification

Narrative Synthesis

Prompt Injection Resistance

Multi-Turn Reasoning

Error Recovery

Privacy & Compliance

Alert Routing

Knowledge Injection

VLM-to-Alert Triage

Why This Matters

See It Run

Local AI vs Cloud
The Benchmark