Inspiration

Shopping independently is still a major challenge for visually impaired individuals. Most existing solutions rely on expensive hardware or provide limited functionality. We wanted to build something simple, accessible, and practical using just a camera and a microphone, so that anyone can navigate and understand products in real-world environments.

What it does

WhereDaMilk is a real-time assistive vision tool that helps visually impaired users find items, read labels, identify objects, and get detailed product information, all through voice commands. It has four modes: FIND (locate items using two-stage YOLO + OCR matching), WHAT (identify and announce objects in view), READ (extract and speak text from labels), and DETAILS (get in-depth product analysis powered by Google Gemini Vision API). Users speak naturally, saying things like "find milk", "read", or "tell me more", and the system responds with spoken guidance.

How we built it

We built WhereDaMilk as a modular Python system with a real-time vision pipeline. YOLOv8n handles object detection at 640x480 via OpenCV. EasyOCR provides text recognition for labels and packaging. Google Gemini Vision API powers the DETAILS mode for deep product analysis (brand, ingredients, nutritional info). SpeechRecognition runs on a background thread for continuous mic input, and ElevenLabs with edge-tts as a fallback handles voice output. The architecture uses a mode handler pattern where each mode (Find, What, Read, Details) is encapsulated in its own handler class with a consistent start/process/reset interface. FIND mode uses a two-stage matching system that first checks YOLO object classes and falls back to OCR text matching, with IoU-based single-target tracking to prevent repeated announcements.

Challenges we ran into

Object mismatch was a core problem. Users say "milk" but YOLO detects "bottle", so we built a two-stage matching system that tries YOLO class matching first, then falls back to OCR text extraction and keyword matching. Repeated audio spam from continuous detections was distracting, so we implemented stateful IoU tracking that announces a found object once and then tracks it silently. Balancing real-time performance with accuracy required choosing lightweight models (YOLOv8n, MiDaS) and running speech recognition on a separate thread to avoid blocking the vision loop. We also switched from PaddleOCR to EasyOCR after running into reliability issues, which significantly improved text recognition consistency.

Accomplishments that we're proud of

Built a fully working real-time assistive system with four distinct modes. Designed a clean modular architecture with separate mode handler classes, reducing main.py from 450 lines to 200 lines during the refactor. Created a robust fallback chain for TTS (ElevenLabs to edge-tts to system audio) so the app works with or without API keys. Implemented two-stage object matching (YOLO + OCR) that handles both common objects and branded/labeled products. Enabled fully hands-free interaction with continuous background voice recognition.

What we learned

Accessibility design means reducing cognitive load, not adding features. One clear announcement is better than repeated alerts. Real-time AI systems need strong state management to coordinate between vision, speech, and audio threads without conflicts. Object detection alone is not enough for real-world product identification. Text recognition is essential for branded items where the class label ("bottle") does not match what the user is looking for ("Coca-Cola"). Practical tool selection matters. Switching from PaddleOCR to EasyOCR solved real reliability problems. Voice interaction design is critical. Throttling, single announcements, and clear mode transitions make the difference between usable and frustrating.

What's next for WhereDaMilk?

Build a mobile app version for iOS and Android. Integrate MiDaS depth estimation for distance-aware guidance ("milk is about 2 feet to your left"). Train custom models for retail-specific products beyond YOLO's 80 default classes. Add multilingual voice support for both commands and TTS output.

Built With

  • easyocr
  • edge-tts
  • elevenlabs
  • flask
  • google-gemini-api
  • opencv
  • opencv-python
  • python
  • python-dotenv
  • speechrecognition
  • timm
  • torch
  • transformers
  • ultralytics
  • ultralytics-yolov8
Share this project:

Updates