Built at Robotic Agents Hackathon | March 13, 2026 | Frontier Tower, SF
VanguardAI transforms the Unitree Go2 quadruped into an autonomous security patrol system controlled by voice commands. Speak naturally — the robot responds in real-time.
End-to-end pipeline: voice → STT → LLM → structured command → robot movement. Full loop under 500ms.
"Go forward" → Robot walks forward
"Turn left" → Robot turns 90° left
"Patrol" → Robot starts patrol pattern
"Stop" → Robot halts immediately
The LLM handles natural language variation — "walk ahead", "move forward", "go straight" all resolve to the same action. Structured JSON output keeps robot command execution deterministic.
Voice Input
↓
Smallest.ai Pulse STT (WebSocket stream, 64ms latency)
↓
Together AI — Llama-3.3-70B
↓ {"action": "move_forward", "value": 1.0}
Cyberwave SDK
↓
Unitree Go2 (4D LIDAR + RGB camera)
| Stage | Latency |
|---|---|
| STT (Smallest.ai Pulse) | ~64ms |
| LLM inference (Together AI) | ~200–400ms |
| Robot command execution | ~50ms |
| End-to-end voice → movement | ~350–550ms |
LLM latency is the dominant term. This is a real-time constraint problem — at 500ms end-to-end, the system is usable for security patrol. Sub-200ms would require either a smaller model or self-hosted inference with speculative decoding. The Together AI serving stack handles the inference infrastructure; understanding what's underneath it (continuous batching, KV cache management, tensor parallelism) informs how to optimize this pipeline further.
| Component | Technology | Detail |
|---|---|---|
| Speech-to-Text | Smallest.ai Pulse | WebSocket stream, 64ms |
| LLM | Together AI — Llama-3.3-70B | Natural language → JSON |
| Robot Control | Cyberwave SDK | Digital twin + locomotion |
| Hardware | Unitree Go2 | Quadruped, 4D LIDAR, RGB camera |
VanguardAI/
├── src/
│ ├── voice.py # Smallest.ai STT — audio → text
│ ├── brain.py # Together AI LLM — text → action JSON
│ ├── robot.py # Cyberwave SDK — action → robot movement
│ └── vision.py # Together AI VLM — camera → threat detection
├── main.py # Orchestration loop
├── requirements.txt
└── .env.example
git clone https://github.com/IneshReddy249/VanguardAI.git
cd VanguardAI
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# SMALLEST_API_KEY=your_key
# TOGETHER_API_KEY=your_key
# CYBERWAVE_API_KEY=your_keypython main.py
# Speak when you see: Listening...| Command | Action |
|---|---|
| "go forward" / "move ahead" / "walk straight" | Forward |
| "go back" / "move backward" | Backward |
| "turn left" | Rotate left 90° |
| "turn right" | Rotate right 90° |
| "stop" / "halt" | Stop |
| "patrol" | Start patrol sequence |
| "quit" / "exit" | End program |
voice.py — Records 4s audio, streams to Smallest.ai Pulse via WebSocket, returns transcribed text.
brain.py — Sends transcript to Llama-3.3-70B with structured output prompt. Returns deterministic JSON:
{"action": "move_forward", "value": 1.0}robot.py — Maps action JSON to Cyberwave SDK motion bindings. Triggers robot locomotion.
vision.py — Streams camera feed to Together AI VLM for real-time threat detection and scene understanding.
main.py — Infinite loop: listen → parse → execute.
Graceful shutdown on quit or Ctrl+C.
The 200–400ms LLM step is the system bottleneck. Optimizing it requires:
- Smaller model — a 7B instruction-tuned model likely handles structured command parsing with <50ms latency, acceptable quality
- Speculative decoding — draft model generates the short JSON output, target verifies; output is typically <20 tokens, ideal for spec decoding
- Self-hosted inference — removing Together AI API overhead and running TensorRT-LLM locally on edge hardware would cut latency by ~150ms
This is the next engineering step for a production deployment.
The inference stack that powers the LLM layer in this system:
- Llama-3.1-8B on H100 — 1,700+ tok/s, 11ms TTFT
- Speculative Decoding — 2.26× latency reduction on Qwen 2.5
- Qwen2.5-32B on H200 — 3,981 tok/s at 64 concurrent
Inesh Reddy Chappidi — LLM Inference & Systems Engineer