Inspiration
The inspiration for IRIS came from a powerful observation: despite incredible technological progress, millions of people with limited hand mobility face barriers to independent computer access. While commercial eye-tracking devices exist, they often cost thousands of dollars, pricing out the very people who need them most.
We asked ourselves: what if we could build something just as powerful using only a standard webcam and AI?
That question drove us to create IRIS, a system born from the belief that accessibility shouldn't be a luxury reserved for those who can afford specialized hardware. It should be a fundamental right, built into the foundation of modern computing. We were inspired not just by the technical challenge, but by the real human impact: giving people independence in communication, education, employment, and daily life.
What it does
IRIS transforms any computer into a fully hands-free workstation. Using only a webcam, it enables users to:
Core Control Features:
Move the cursor by simply looking around the screen - your nose movements directly control where the cursor goes Click with a double blink - quick and intuitive Right-click with a triple blink - access context menus without hands Switch applications by opening your mouth for 5 seconds - activates the app switcher (Cmd+Tab on Mac)
Activate voice commands by raising your shoulders - triggers voice input mode
Voice Command Integration: Once voice mode is activated through the shoulder raise gesture, users can:
Issue typing commands - say anything and IRIS will type it for you
Send emails automatically - say "emai" and IRIS will:
Generate a professional email with subject and body using GPT-4 Navigate to Gmail automatically Fill in all fields (recipient, subject, message) Verify everything is correct using AI vision Send the email with one command
The system uses Whisper AI for speech transcription and GPT-4o to understand natural language and execute complex tasks.
Smart Email Assistant: Our crown jewel is the intelligent email feature. IRIS doesn't just type what you say. it understands your intent and handles the entire workflow:
Extracts email addresses from natural speech Generates professional subject lines and message bodies Automates the entire Gmail interface Uses computer vision to verify fields before sending All from a single voice command
How we built it
IRIS combines multiple cutting-edge technologies into a cohesive system: Computer Vision Layer:
MediaPipe FaceMesh tracks 468 facial landmarks in real-time for precise facial feature detection MediaPipe Pose detects shoulder movements for voice activation Calculates iris positions, eyelid distances, and mouth openness with millimeter precision Uses mathematical metrics like EAR (Eye Aspect Ratio) for blink detection and MAR (Mouth Aspect Ratio) for mouth gesture detection Tracks nose position as the primary cursor control mechanism
Advanced Cursor Control:
Dynamic sensitivity adjustment - cursor moves faster with larger head movements, slower for precision work Exponential moving average (EMA) smoothing - eliminates jittery movements for stable control Dead-zone filtering - prevents micro-movements from causing unintended cursor drift Head-pose correction using solvePnP - maintains accuracy across different head positions and angles Intelligent velocity-based sensitivity that adapts to your movement speed
Gesture Recognition System:
Sophisticated blink detection with adaptive EAR calibration that learns your natural blink patterns Distinguishes double blinks (left click) from triple blinks (right click) Time-window-based gesture recognition with cooldown periods to prevent false triggers Mouth-hold detection with configurable thresholds for app switching Shoulder elevation tracking using pose landmarks with pixel-based threshold detection
Browser Automation Engine:
JavaScript injection into Chrome via AppleScript for direct DOM manipulation Hardcoded Gmail workflow that navigates the interface programmatically Smart element selection using multiple CSS selector fallbacks Field verification using GPT-4 Vision - takes screenshots and validates form completion before sending
AI Voice Control Pipeline:
PyAudio captures real-time audio input during shoulder raise Deepgram's Nova-2 model transcribes speech with high accuracy GPT-4o-mini generates professional email content from natural language hints Regex-based parsing extracts email addresses and message topics from spoken commands OpenAI GPT-4o with vision capabilities verifies email composition before sending
System Integration:
PyAutoGUI simulates OS-level mouse movements, clicks, and keyboard input Global input locking prevents gesture conflicts during voice processing Concurrent audio recording managed with frame-based buffer system Multi-threaded architecture handles vision processing, audio capture, and system control simultaneously
Calibration System:
60-frame initial calibration establishes baseline nose position and shoulder height User-specific threshold adaptation for accessibility across different mobility levels Real-time visual feedback during calibration process
The architecture is modular with separate classes for cursor smoothing, dead-zone handling, coordinate mapping, and audio recording, making the system maintainable and extensible.
Challenges we ran into
Building IRIS pushed us through several significant technical hurdles:
Nose-Based Cursor Tracking: Using the nose tip as a cursor control point was more challenging than eye-gaze tracking because head movements are larger and more dynamic. We developed a multi-layered smoothing system combining EMA filtering, dead-zone thresholds, and velocity-based dynamic sensitivity to achieve stable, responsive cursor control without specialized hardware.
Multi-Modal Gesture Coordination: Simultaneously tracking facial landmarks (468 points) and body pose (33 landmarks) in real-time required careful optimization. We had to balance processing speed with accuracy, ensuring MediaPipe could handle both models concurrently without frame drops or latency issues.
Blink vs. Natural Eye Movement: Early versions frequently mistook normal blinks for commands, causing chaos with unintended clicks. We implemented adaptive EAR calibration that learns each user's natural blink baseline during the 60-frame calibration period, plus time-window analysis to distinguish intentional double/triple blinks from random eye closures.
Mouth Hold Duration Tuning: Finding the right threshold for mouth-open app switching (5 seconds) required extensive testing. Too short and normal speech would trigger it; too long and users with limited facial mobility would struggle. We settled on 5 seconds with continuous app-cycling capability once activated.
Shoulder Raise Detection: Shoulder detection proved highly variable across users. Body size, posture, and camera angle all affected baseline shoulder position. We developed a pixel-based threshold system (15 pixels of vertical movement) that calibrates to each individual during setup, with adaptive thresholds for users with limited shoulder mobility.
Browser Automation Reliability: Gmail's dynamic DOM structure changes frequently, making element selection fragile. We implemented multiple CSS selector fallbacks, JavaScript injection via AppleScript for cross-process communication, and retry logic. The verification step using GPT-4 Vision acts as a safety net, ensuring fields are filled before sending.
Email Address Extraction: Parsing natural speech for email addresses required robust regex patterns that handle various speech-to-text quirks. Users don't speak in structured commands, so we had to handle phrases like "email John at [email protected] saying hello" with flexible pattern matching.
Voice Command Processing Latency: Managing concurrent audio recording, Deepgram transcription, GPT inference, Gmail navigation, and field filling without blocking the main cursor control loop required careful threading and input locking. We implemented a global lock that disables gesture processing during voice commands to prevent interference.
AI Verification Timing: Taking screenshots at the right moment, after all fields are filled but before sending, required precise timing coordination between JavaScript execution, DOM updates, and PyAutoGUI actions. We added strategic sleep delays and used JavaScript return values to confirm action completion.
Accomplishments that we're proud of
We built a fully functional accessibility tool using only a webcam and AI, no expensive specialized hardware required. Through the python workshop, we learned how to use loops and if statements, which we ended up applying in our code.
Technical Achievements:
Nose-based cursor control that rivals commercial eye trackers in accuracy and responsiveness Intelligent email automation that understands natural language and executes complex multi-step workflows autonomously Multi-modal gesture recognition combining face, pose, and voice inputs seamlessly AI-powered verification that ensures reliability using computer vision
Real-World Impact: What truly matters is the impact. This tool can help people with physical disabilities regain independence in their digital lives. The email feature alone could be transformative, someone who previously needed assistance for every email can now compose and send messages independently through simple voice commands.
Innovation Beyond Proof-of-Concept: We didn't just demonstrate that voice-controlled email is possible; we built a production-ready system that generates professional content, navigates real web interfaces, verifies accuracy, and handles edge cases. The integration of GPT-4 for content generation with GPT-4 Vision for verification shows our commitment to reliability.
Accessibility-First Design: Every gesture was carefully chosen for users with limited mobility. Nose movements require less fine motor control than eye gaze. Shoulder raises are accessible to many users who can't use hand gestures. The 5-second mouth hold is intentional enough to avoid false triggers but achievable for users with facial limitations.
What we learned
Workshop:
Through the intro to python workshop, we learned how to use loops and if statements, from these skills we learned we applied it to our project. In addition, the Intro AI workshop and the utilization of API keys, we were able to make the AI feature within our device.
This project became a masterclass in multiple domains: Computer Vision Mastery:
Real-time facial landmark and pose tracking at 30+ fps Geometric feature extraction from 468 facial points and 33 body landmarks EAR and MAR calculation for reliable gesture detection Head-pose estimation using solvePnP for 3D orientation tracking Multi-layer smoothing techniques (EMA, dead-zone, velocity-based)
Browser Automation Expertise:
JavaScript injection across process boundaries using AppleScript DOM manipulation and element selection strategies with fallbacks Web application reverse-engineering (Gmail's interface structure) Screenshot timing and synchronization with asynchronous DOM updates
AI Integration:
Speech-to-text with Deepgram's Nova-2 model Prompt engineering for GPT-4o to generate structured content (subject + body JSON) Multimodal AI with GPT-4 Vision for screenshot verification Natural language understanding for command parsing and intent extraction
Python Architecture:
Modular class-based design for complex real-time systems Concurrent processing: video, audio, AI inference, and system control Resource management: camera, audio streams, PyAutoGUI Error handling and graceful degradation when dependencies are missing
Accessibility Design Philosophy: We learned that accessibility isn't about adding features, it's about rethinking fundamental interaction paradigms. Small decisions have enormous impact:
Gesture thresholds too tight exclude users with tremors Delays too short cause false triggers Workflows too complex become unusable under cognitive load
Human-Centered Testing: Every threshold, timing parameter, and gesture choice reflects hours of testing and iteration. We learned to design for the margins, if it works for users with severe limitations, it works for everyone. Prompt Engineering for Structured Output: Getting GPT to reliably return JSON with specific keys required careful prompt design, including explicit format examples and instruction emphasis. We learned to parse responses defensively, handling cases where the model adds preambles or markdown formatting. Most importantly, we learned how to blend engineering excellence, AI capabilities, and genuine empathy into technology that serves humanity.
What's next for IRIS
IRIS is just the beginning. Our roadmap includes: Near-Term Improvements:
Customizable user profiles - save individual calibration settings, gesture thresholds, and sensitivity preferences Visual calibration dashboard - GUI for adjusting parameters without editing code Expanded email templates - support for common message types (thank you, follow-up, meeting requests) Additional shortcuts - automate other workflows like calendar entry, document creation, web searches Dwell-to-click option - hold gaze on elements to click without blinking ML-powered gesture classification - train models on user data for personalized, robust detection
Platform Expansion:
Cross-platform browser automation - Windows/Linux support using Selenium or Playwright Multi-browser support - extend beyond Chrome to Firefox, Safari, Edge Mobile version - adapt for iOS/Android using front-facing cameras Web-based interface - browser extension for universal web accessibility
Advanced Features:
Smart-home control - extend voice commands to IoT devices (lights, thermostats, locks) Application-specific shortcuts - custom gestures for Photoshop, Word, Slack, etc. Collaborative filtering - share and discover shortcuts created by the community Real-time translation - speak in any language, send emails in recipient's language Accessibility analytics - track usage patterns to optimize gesture assignments
AI Enhancements:
Contextual email generation - reference previous conversations and adapt tone automatically Multi-step task planning - "book a flight to NYC next Tuesday" handled end-to-end Proactive assistance - suggest actions based on screen content and user habits Adaptive learning - system improves accuracy by learning from corrections
Research Directions:
Minimal-movement gestures - support users with extremely limited mobility Thought-typing - explore integration with emerging BCI technologies Emotion detection - adjust system behavior based on user frustration or fatigue Multi-user support - quick profile switching for shared devices
Our ultimate mission is making digital independence universally accessible for all disabled individuals. Every person deserves the ability to interact with technology on their own terms, to send emails, browse the web, create content, and participate fully in the digital world, regardless of physical ability. IRIS proves that powerful accessibility tools don't require expensive hardware or proprietary systems. They require creative thinking, technical skill, and genuine care for the people they serve. We're committed to building that future, one gesture, one command, one empowered user at a time.
Log in or sign up for Devpost to join the conversation.