Inspiration

Canada is facing a skilled trades crisis. There are over 250,000 unfilled trades positions nationwide, and the gap is widening. Junior technicians are entering the workforce underprepared — handed a 200-page PDF service manual and expected to diagnose a failing furnace or AC unit in someone's home, in the middle of winter, alone. The knowledge exists in those manuals, but it's locked behind dense technical language, hard to search on a job site, and completely silent when a technician needs it most. We asked: what if a junior tech could just point their phone at the broken equipment, and have an expert walk them through the fix hands-free, step by step, out loud?

What it does

Field-Ops Vision Guide is an AI-powered repair assistant for field technicians. It turns a short video of broken HVAC equipment into a fully voiced, IKEA-style repair guide grounded in the manufacturer's service manual. The workflow has three stages:

  1. Smart Diagnosis The technician records a short video of the broken equipment — capturing both visuals and sound. The video is preprocessed through Cloudinary for optimization, then analyzed by Gemini 2.5 Flash which reads the footage alongside the manufacturer's PDF manual. Gemini identifies the equipment, detects abnormal sounds (clicking, buzzing, rattling), spots visual faults, and produces a diagnosis with a confidence score. If it can't identify the model from the video, the user is prompted to photograph the nameplate directly.
  2. Instruction Engine Once the fault is confirmed, Gemini performs RAG (retrieval-augmented generation) directly on the service manual PDF — extracting the exact relevant section verbatim, then generating a structured set of IKEA-style repair steps. Each step contains a single physical action (10 words or fewer), a safety caution, a component category, and a visual description prompt. The steps are then passed to ElevenLabs' Flash v2.5 TTS using the "Brian" voice, generating an MP3 audio file per step that the frontend can stream.
  3. Hands-Free UI The Next.js PWA frontend presents each repair step one at a time. The technician navigates using voice commands ("next") via the Web Speech API — keeping their hands free while working. If they get stuck, they can tap "I'm Stuck" to trigger Cloudinary Visual Overlays: annotated images that highlight exactly which component to look at.

How we built it

Backend: FastAPI (Python) serving four core routes — /api/identify, /api/triage, /api/repair, /api/visual-assist Media pipeline: Cloudinary handles video/image optimization before sending to Gemini, and serves annotated overlay images for visual assistance AI: Gemini 2.5 Flash for multimodal video + audio + PDF analysis across three chained prompts Frontend: Next.js PWA with Web Speech API for hands-free voice navigation Infrastructure: Uvicorn, python-dotenv, httpx, google-genai SDK The Gemini pipeline uses three structured calls: equipment identification from media, fault diagnosis cross-referenced against the manual, and IKEA-style step generation with response_mime_type: application/json and temperature: 0.0 for deterministic output.

Challenges we ran into

File state management. Gemini's File API processes uploads asynchronously. We learned the hard way that passing a file URI before it reaches ACTIVE state throws a FAILED_PRECONDITION error. We added a polling loop to wait for activation before proceeding. Manual-fault mismatch. When the equipment in the video didn't match the manual on disk, Gemini correctly returned empty repair sections rather than hallucinating. This was actually the system working as intended, but it pushed us to be more precise about matching media to manuals for the demo. ElevenLabs silent failures. The voice generation pipeline was returning an empty audio map with no error output. The issue was a silent “except” block swallowing API errors. We added explicit status code logging which revealed a rate limit issue, resolved by adding per-step throttling.

Accomplishments that we're proud of

A fully working end-to-end pipeline from raw video → Cloudinary → Gemini vision + audio analysis → RAG on PDF manual → voiced step-by-step repair guide Gemini correctly diagnosing a noisy AC compressor from a Reddit HVAC video with high confidence, detecting the mechanical buzzing from audio alone The hands-free UX concept — a technician with dirty hands in a crawl space can navigate an entire repair using only their voice Clean modular architecture: cloudinary_upload.py, instr_gen.py, voice_gen.py each owning a single concern, imported cleanly into main.py

What we learned

Gemini 2.5 Flash's multimodal capabilities are genuinely impressive for audio-visual diagnostics — it picked up mechanical fault sounds we didn't explicitly prompt for response_mime_type: application/json with temperature: 0.0 is the correct pattern for structured Gemini output — no fence stripping needed Trades knowledge is deeply embedded in manufacturer PDFs and Gemini handles PDF RAG well when prompted to extract verbatim rather than paraphrase Designing for hands-free use forces you to think differently about UX — every interaction that requires a tap is a failure mode for a technician mid-repair

What's next for OperaAI

Live manual registry: Map equipment model numbers to manuals automatically so any supported unit works without hardcoding Cloudinary visual overlays: Complete the spatial annotation pipeline — Gemini returns bounding box coordinates, Cloudinary draws arrows on the image pointing at the exact component Backboard persistent memory: Store technician job history so returning faults on the same unit surface previous diagnoses and resolutions Offline mode: Cache manuals and steps locally so the app works in basements and mechanical rooms with no signal Expand beyond HVAC: Electrical panels, water heaters, commercial refrigeration — the pipeline is equipment-agnostic

Built With

  • cloudinary
  • elevenlabs
  • fastapi
  • google-gemini
  • next.js
  • pydantic
  • python
  • uvicorn
  • web-speech-api
Share this project:

Updates