Inspiration
My earlier employment dealt with buidling robotic camera systems. This involved computer vision systems for sports broadcasting. Recent advancement in computer vision especially the supervision library has been an inspiration for this work
What it does
HerdFlow is a voice-first AI assistant for livestock monitoring. It watches a camera feed, detects and tracks individual animals in real-time, and lets farmers have natural voice conversations about their herd. Ask "how's cow 3 doing?" and HerdFlow answers from live scene data — no typing, no screens to tap.
Real-time animal detection & tracking — RF-DETR + ByteTrack identifies and follows individual animals across frames
- Natural voice interaction — Gemini 2.5 Flash Native Audio enables bidirectional speech with a veterinary persona
- Visual scene understanding — Gemini 3 Flash analyzes frames every 30 seconds, annotating each animal with posture, color, and health notes
- Proactive health alerts — detects prolonged lying, isolation from herd, and missed feeding automatically
- Tool-augmented reasoning — 6 tools for querying animal history, herd stats, zone occupancy, and visual analysis on demand
How we built it
HerdFlow uses a two-pipe architecture — voice and video run as independent processes connected via LiveKit data channels:
- Voice Process: Google ADK Agent with Gemini 2.5 Flash Native Audio (Live API) for real-time bidirectional speech, Google Cloud STT for async transcription, conversation memory with automatic session rotation every 8 minutes
- Video Process: RF-DETR object detection → ByteTrack multi-object tracking → Scene Graph → Gemini 3 Flash for visual summaries and per-animal annotations
- AnalystBridge: Cross-pipe data relay so the voice agent can describe what the camera sees without directly processing video
- React Frontend: LiveKit WebRTC with annotated video overlay, voice UI, alert cards, and herd dashboard
Challenges we ran into
- ** Started on this Sunday afternoon, came to know about the hackathon late.
- Gemini Live 10-minute session limit caused 1011 Deadline Expired errors. Solved with proactive session rotation at 8 minutes + conversation memory carry-over.
- Video-in-audio pipeline instability — sending video frames through Gemini Live API caused 1007 errors. Solved by splitting into two independent processes (two-pipe architecture).
- Voice latency — thinking tokens from Gemini added delay. Disabled thinking budget (thinking_budget=0) and removed scene injection loops to get sub-second first-audio-frame latency.
- Context stuffing — long sessions caused Gemini to lose context. Built ConversationMemory that carries observations and farmer questions across session rotations.
Accomplishments that we're proud of
- Two-pipe architecture that cleanly separates concerns — voice and video can scale independently
- Sub-second voice response latency with full tool-calling capability
- Natural veterinary persona that references specific animals by track ID
- The feeling of intelligence it evokes
What we learned
- Gemini Live API is powerful but requires careful session lifecycle management
- ADK's LiveRequestQueue + Runner pattern makes real-time audio streaming clean
- Separating voice and video into independent processes (connected via data channels) is more robust than a monolithic agent
- ByteTrack + RF-DETR gives surprisingly good individual animal tracking even on low-resolution webcam feeds
What's next for HerdFlow
HerdFlow is an instance of the VisionFlow pattern — a domain-agnostic architecture for real-time visual monitoring with AI reasoning. The same three-tier design (Perception → Reasoning → Communication) can adapt to construction safety, warehouse ops, wildlife conservation, or manufacturing QA. Domain-specific code is isolated in configuration, prompts, and alert rules.
Log in or sign up for Devpost to join the conversation.