HerdFlow

Inspiration

My earlier employment dealt with buidling robotic camera systems. This involved computer vision systems for sports broadcasting. Recent advancement in computer vision especially the supervision library has been an inspiration for this work

What it does

HerdFlow is a voice-first AI assistant for livestock monitoring. It watches a camera feed, detects and tracks individual animals in real-time, and lets farmers have natural voice conversations about their herd. Ask "how's cow 3 doing?" and HerdFlow answers from live scene data — no typing, no screens to tap.

Real-time animal detection & tracking — RF-DETR + ByteTrack identifies and follows individual animals across frames

Natural voice interaction — Gemini 2.5 Flash Native Audio enables bidirectional speech with a veterinary persona
Visual scene understanding — Gemini 3 Flash analyzes frames every 30 seconds, annotating each animal with posture, color, and health notes
Proactive health alerts — detects prolonged lying, isolation from herd, and missed feeding automatically
Tool-augmented reasoning — 6 tools for querying animal history, herd stats, zone occupancy, and visual analysis on demand

How we built it

HerdFlow uses a two-pipe architecture — voice and video run as independent processes connected via LiveKit data channels:

Voice Process: Google ADK Agent with Gemini 2.5 Flash Native Audio (Live API) for real-time bidirectional speech, Google Cloud STT for async transcription, conversation memory with automatic session rotation every 8 minutes
Video Process: RF-DETR object detection → ByteTrack multi-object tracking → Scene Graph → Gemini 3 Flash for visual summaries and per-animal annotations
AnalystBridge: Cross-pipe data relay so the voice agent can describe what the camera sees without directly processing video
React Frontend: LiveKit WebRTC with annotated video overlay, voice UI, alert cards, and herd dashboard

Challenges we ran into

** Started on this Sunday afternoon, came to know about the hackathon late.
Gemini Live 10-minute session limit caused 1011 Deadline Expired errors. Solved with proactive session rotation at 8 minutes + conversation memory carry-over.
Video-in-audio pipeline instability — sending video frames through Gemini Live API caused 1007 errors. Solved by splitting into two independent processes (two-pipe architecture).
Voice latency — thinking tokens from Gemini added delay. Disabled thinking budget (thinking_budget=0) and removed scene injection loops to get sub-second first-audio-frame latency.
Context stuffing — long sessions caused Gemini to lose context. Built ConversationMemory that carries observations and farmer questions across session rotations.

Accomplishments that we're proud of

Two-pipe architecture that cleanly separates concerns — voice and video can scale independently
Sub-second voice response latency with full tool-calling capability
Natural veterinary persona that references specific animals by track ID
The feeling of intelligence it evokes

What we learned

Gemini Live API is powerful but requires careful session lifecycle management
ADK's LiveRequestQueue + Runner pattern makes real-time audio streaming clean
Separating voice and video into independent processes (connected via data channels) is more robust than a monolithic agent
ByteTrack + RF-DETR gives surprisingly good individual animal tracking even on low-resolution webcam feeds

What's next for HerdFlow

HerdFlow is an instance of the VisionFlow pattern — a domain-agnostic architecture for real-time visual monitoring with AI reasoning. The same three-tier design (Perception → Reasoning → Communication) can adapt to construction safety, warehouse ops, wildlife conservation, or manufacturing QA. Domain-specific code is isolated in configuration, prompts, and alert rules.

Built With

adk
build
bytetrack
cloud
css
gemini
genai
google
livekit
pydantic
rf-detr
sdk
sqlite
stt
supervision
tailwind

Updates

Aravind V started this project — Mar 16, 2026 07:48 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.