Inspiration

In urban combat, room clearance is one of the most dangerous moments a soldier faces. A closed door, zero visibility, no idea what's on the other side. Operators stack up and breach blind. We asked: what if a ground robot could enter first, see everything, and report back through voice — not a screen. Eyes on your rifle, ears on BREACHER.

What it does

BREACHER is a voice-first autonomous tactical intelligence system built for military ground operations. It drives a UGV Beast rover into hostile rooms, analyzes the scene with GPT-4o Vision at 2 FPS, and narrates findings through sub-100ms Smallest.ai text-to-speech — completely hands-free. Operators command it naturally: "Breacher, how many hostiles?" "Breacher, threat level?" "Breacher, hold position." It assigns NATO phonetic callsigns (Alpha, Bravo, Charlie) to detected subjects, tracks movement and posture changes, escalates threats through a 4-tier priority alert queue, and autonomously sweeps rooms using a clockwise wall-following algorithm. A React dashboard provides a Mission Debrief view for command staff after-action review, but the core product is the voice — no screen required downrange.

How we built it

Four async Python threads running concurrently: vision analysis (GPT-4o encoding camera frames to base64), TTS output (Smallest.ai Lightning with barge-in support), STT input (Smallest.ai Pulse with wake-word detection), and autonomous navigation (Cyberwave SDK controlling the rover's motors and ultrasonics). A WebSocket server bridges the backend to a React + Tailwind frontend styled as a Palantir-inspired military terminal. Pydantic models enforce structure on every vision response. The entire system is orchestrated through a single asyncio event loop in main.py.

Challenges we ran into

Barge-in coordination: When an operator interrupts mid-speech, the TTS must cancel instantly while the STT simultaneously captures the new command — getting these two async streams to cooperate without race conditions was the hardest technical problem. Vision latency vs. accuracy: GPT-4o gives incredible scene understanding but at ~500ms per frame. We had to architect the entire alert pipeline to be non-blocking so stale frames never delay critical voice output. Dead reckoning drift: Without SLAM or LIDAR, the rover's position estimate drifts over time using only motor commands and timing. We tuned the wall-following algorithm to be robust to this uncertainty.

Accomplishments that we're proud of

True voice-first UX — the entire system works with zero screen interaction, exactly how it needs to work in a combat environment Sub-100ms TTS latency from scene detection to spoken alert 4-tier priority queue that correctly interrupts a status update to announce a weapon detection NATO phonetic callsign tracking that persists across frames ("Alpha moved from northwest corner to doorway") Military terminology mode that speaks the way operators actually communicate — no translation needed The whole thing built and integrated in under 6 hours

What we learned

Voice is a fundamentally different interface than visual. When you design for operators who can't look at a screen — because they're holding a weapon in a hallway — every design decision changes. You can't show a table of 5 threats, you have to prioritize one. You can't display a map, you have to describe positions spatially using clock directions and room geometry. Constraint breeds clarity — BREACHER's voice output is more tactically useful than most C2 dashboards because it has to be.

What's next for Breacher

SLAM integration for accurate indoor mapping instead of dead reckoning Multi-rover coordination with shared scene models for compound clearing On-device vision (fine-tuned model on rover compute) to eliminate cloud dependency in DDIL environments Thermal/IR camera fusion for night operations and smoke-filled rooms Integration with existing military C2 systems and battle management networks Field evaluation with infantry and SOF units

Built With

Share this project:

Updates