Inspiration
Soccer coaches and analysts spend hours manually reviewing footage — pausing, rewinding, sketching formations on whiteboards, and trying to communicate complex tactical ideas to their players. The tools available are either expensive enterprise platforms or basic video players with no intelligence built in. There's no way to simply talk to your match footage and get real-time analysis back. We wanted to change that — to make analyzing a soccer match feel less like scrubbing through video and more like having a conversation with a world-class tactical analyst. One that watches the match with you, listens to your questions, draws play corrections on screen, and speaks back with grounded insights — all in real-time, all by voice, without ever touching your mouse.
Phantom Coach is built to solve exactly this.
What it does
Phantom Coach is a multimodal AI coaching assistant powered by the Gemini 2.5 Flash Live API. Upload any soccer match and have a natural, interruptible voice conversation with an AI that operates at the level of a UEFA Pro License tactical analyst.
Ask it what you see. Interrupt it mid-sentence. Tell it to switch to the 2D tactical board. Watch as it draws correction arrows, simulates player movements, and highlights passing lanes — all through voice commands, without ever touching your mouse.
Core capabilities:
🎙️ Live Voice Analysis — Bidirectional streaming with Gemini 2.5 Flash. Gemini natively watches the raw video frames in real-time alongside you, identifies formations, detects pressing triggers, and narrates tactical insights while you interrupt and redirect naturally.
📐 Auto-Generated 2D Tactical Board — Real-time player tracking powered by YOLOv8 + ByteTrack with appearance Re-ID. Positions are calibrated to a normalized pitch via RANSAC homography with Kalman smoothing. The tactical board updates live as the match progresses.
🎯 Voice-Commanded Tactical Simulations — Say "Show me where the left winger should be" and the AI animates player movements, draws correction arrows, and highlights runs directly on the 2D board — no mouse, no keyboard.
⚡ Real-Time Tactical Alerts — A computer vision pipeline detects turnovers, pressing triggers, line-breaking passes, and formation shifts in real-time, feeding structured grounding data into Gemini alongside the video feed to prevent hallucination.
⏱️ Smart Video Indexing — Fast-forward directly to critical moments in the match. The backend automatically indexes key events so you can jump straight to the tactical situations that matter most (e.g. "Let's go to the next defensive break in the game").
🗣️ Natural Interruption (Barge-In) — Full support for interrupting the AI mid-sentence. Ask a follow-up question, ask for clarification on a specific player's movement, tell it to switch views, and more — all handled seamlessly by the Live API.
📸 Exportable Tactical Plans — Export your tactical board as a high-resolution image — complete with player positions, drawn correction arrows, and annotations. Share plays with your team instantly.
How we built it
Phantom Coach is built as a 4-tier system: User → Frontend → Backend → Google Cloud.
Frontend — React 19, Vite, TypeScript, Tailwind CSS 4, Zustand for state management, and Framer Motion for animations. A custom MultimodalStreamer service captures audio from the microphone and JPEG frames from the video player, streaming them over a single WebSocket connection to the backend.
Backend — A FastAPI server running on Google Cloud Run. The core is the GeminiLiveClient, which manages a bidirectional session with the Gemini 2.5 Flash Live API using the google-genai SDK. An AgentStateMachine with 4 distinct states (Live Analysis → Transition → Tactical Board → Waiting for Confirmation) gates which tools and system prompts are active at any given time, keeping the AI grounded and contextually aware.
Agent Pipeline — Three EventBus-driven agents form the computer vision backbone:
- VisionTrackingAgent — Ingests raw frames, runs YOLOv8 detection, tracks players via a ByteTrack implementation with appearance Re-ID, and calibrates positions to pitch coordinates using RANSAC homography.
- StandardizerEngine — Normalizes raw pixel coordinates to a 0–100 UI coordinate system for consistent board rendering.
- TacticalAnalysisAgent — Analyzes normalized board states using Voronoi pitch control, formation detection, and Expected Threat (xT) modeling. Emits
TacticalAlertevents that are injected into Gemini's context as grounding data, supplementing the model's own visual interpretation of the video stream.
Google Cloud — Gemini 2.5 Flash Live API for bidirectional voice + vision, Cloud Run for container hosting with auto-scaling, Firebase Authentication for user management, Cloud Firestore for project persistence, and Firebase Storage for video/frame storage.
Architecture

Technologies Used
| Category | Technologies |
|---|---|
| Frontend | React 19, Vite 7, TypeScript, Tailwind CSS 4, Zustand, Framer Motion, Lucide Icons |
| Backend | FastAPI, Python 3.11, Uvicorn |
| AI / ML | Gemini 2.5 Flash (Live API via google-genai SDK), YOLOv8 (Ultralytics), ByteTrack |
| Computer Vision | OpenCV, RANSAC Homography, Kalman Filtering, Voronoi Tessellation, DBSCAN Clustering |
| Cloud | Google Cloud Run, Firebase (Auth, Firestore, Storage), Cloud Build |
| DevOps | Docker, automated deployment via deploy.sh |
Google Cloud Services
| Service | Usage |
|---|---|
| Gemini 2.5 Flash Live API | Bidirectional streaming — real-time voice + vision analysis via google-genai SDK |
| Google Cloud Run | Container hosting with auto-scaling (1–5 instances, 4 vCPU, 4 GB RAM, always-on CPU) |
| Firebase Authentication | User sign-in and session management |
| Cloud Firestore | Persistent storage for coaching projects and match data |
| Firebase Storage | Video uploads and extracted frame storage |
| Cloud Build | Container image building, fully automated via deploy.sh |
Third-Party Integrations
- Ultralytics (YOLOv8) — Object detection, pose estimation, and segmentation models for player/ball tracking
- OpenCV — Image processing, homography computation, optical flow, and frame manipulation
- scikit-learn — DBSCAN clustering for formation line detection
- SciPy — RBF interpolation (topological mapping) and spatial distance computations
- Firebase SDK (JavaScript) — Frontend authentication and real-time data
- React / Vite / Tailwind CSS / Zustand / Framer Motion — Frontend framework, build tooling, styling, state management, and animations
Challenges we ran into
Managing bidirectional concurrency. The Gemini Live API maintains a single bidirectional WebSocket session. We're simultaneously sending audio chunks, video frames (as base64 JPEG), and tool responses — while receiving audio responses, tool calls, and text. Coordinating these concurrent streams without dropping packets or causing the session to stall required careful lock management and a dedicated send queue.
Grounding the LLM to prevent hallucination. While Gemini is incredibly powerful at natively watching and interpreting the raw video frames via the Live API, LLMs struggle to calculate exact absolute coordinates or calculate spatial math by eye. Without structured data, it might hallucinate that an offside player was actually onside. Our solution was a Hybrid Vision Architecture. While Gemini natively streams the video, a parallel EventBus-driven CV agent pipeline (YOLOv8 → ByteTrack → Homography Matrix) runs silently in the background. This calculates exact pitch control percentages (Voronoi), precise normalized coordinates, and Expected Threat (xT). This mathematical ground truth is injected into Gemini's context alongside the video feed, giving the model the exact numbers it needs to back up what it is seeing visually.
State machine design for tool gating. Early versions let Gemini call any tool at any time, which led to chaos — drawing on the tactical board while in video mode, trying to animate players that hadn't been placed yet. The AgentStateMachine solved this by defining 4 explicit states, each with its own filtered tool set and system prompt. Transitions are deterministic and triggered by specific tool calls, not by the LLM's judgment.
Coordinate system stability. In analysis mode, the 2D board uses a video screenshot overlay so users can see exactly where players are on the actual frame. Tracking player positions accurately on top of this requires handling noisy detections as cameras pan, zoom, and cut between angles. We layered multiple stabilization techniques: a ByteTrack implementation with appearance Re-ID for consistent player identity across frames, Kalman filtering for temporal smoothing of detections, and a soft-NMS approach that prevents duplicate tracks from cluttering the board.
Accomplishments we're proud of
- Zero-mouse analysis — Once the match is uploaded and playing, the entire workflow is governed by voice. You can ask the AI what it currently sees, request real-time tactical advice, fast-forward directly to indexed tactical vulnerabilities (like defensive breaks or possession loss), transition to the 2D board, have the AI draw correction arrows, simulate plays on the screen, export plays, and more — entirely hands-free.
- Real-time grounded analysis — We successfully built a system where Gemini's vast visual understanding of a live video stream is anchored by a parallel, high-speed deterministic math/CV engine. The CV pipeline tracks players and feeds tactical data back into Gemini's context fast enough that the AI's spoken analysis stays perfectly synchronized with what's actually happening on screen.
- Exportable tactical plans — Coaches can export annotated tactical board snapshots with player positions and correction arrows, making it immediately useful for team meetings and training sessions.
What we learned
Building a real-time multimodal agent is fundamentally different from building a chatbot. The interaction model isn't request/response — it's a continuous, bidirectional stream where the user, the AI, and the computer vision pipeline are all producing data simultaneously. The biggest architectural insight was that the LLM needs guardrails at the infrastructure level, not just the prompt level. The state machine, the grounding pipeline, and the tool gating all exist to constrain Gemini into producing reliable, contextual responses — and that constraint is what makes the interaction feel fluid rather than chaotic. Giving the model freedom to speak while restricting when and how it can act is what turns a powerful LLM into a trustworthy coaching tool.
What's next for Phantom Coach
- Multi-camera support for full-field coverage
- Team-specific formation libraries and playbook templates
- Integration with professional match data feeds (Opta, StatsBomb) for advanced grounding
- Mobile companion app for on-field coaching
Log in or sign up for Devpost to join the conversation.