GazeKeys

Tomorrow, three people will die due to an incurable terminal illness. The day after, three more. Amyotrophic Lateral Sclerosis is a relentless, progressive neurodegenerative disease that strips away every voluntary movement- your hands, your voice, your ability to breathe- while leaving your mind completely intact. There is no cure. There is no slowing it down meaningfully. The average life expectancy after diagnosis is just two to five years, and fewer than 20% of patients survive beyond five years (Firebase). Today, over 32,000 Americans are living with ALS, and that number is projected to grow by 25% by 2040 (Google).

What struck us was not just the cruelty of the disease itself- but the cruelty of what patients are left with. As ALS progresses, most patients lose the ability to use their hands entirely. And yet the tools that exist to help them communicate and interact with the world cost $5,000 to $15,000. Dedicated infrared eye tracking hardware, the gold standard for ALS communication, is financially out of reach for most families. Many patients are left staring at a screen they cannot control, in a body they cannot move, unable to ask for water.

We built GazeKeys because we believe that that's unacceptable- and fixable.

GazeKeys on duty

GazeKeys turns a standard laptop webcam into a full eye-tracking accessibility system. Using only their eyes, a user can:

Type messages using a gaze-controlled on-screen keyboard with Moorcheh-enabled semantic text prediction
Send emails and texts by selecting an intent and letting an AI agent draft and send on their behalf
Browse the web in a dedicated mode where their gaze controls the real OS cursor across the full screen
Complete a nightly health check-in- five questions about motor function, fatigue, mood, communication, and pain- with Gemini analyzing trends over time and delivering spoken alerts when something is wrong
Hear their typed sentences spoken aloud via ElevenLabs text-to-speech, giving them a voice

The system requires no special hardware. Just a webcam.

Underneath the hood

Eye tracking is handled by MediaPipe FaceMesh, which tracks 478 facial landmarks including four iris points per eye at ~30fps. We compute normalized iris offsets relative to stable anatomical anchors (the eye corner landmarks, which don't move with lid motion), subtract a calibrated neutral baseline, apply an exponential moving average filter, and integrate the velocity over time to drive a cursor. A 9-point homography calibration maps iris-space coordinates to screen-space for improved accuracy.

The keyboard and UI are rendered as transparent overlays on the OpenCV camera feed using alpha blending. The button bar, suggestion chips, and keyboard keys are all hit-tested against the cursor position each frame, with selection triggered by the user's choice of dwell-time or a global hotkey listener via pynput- which works even when the browser has focus in browse mode.

The AI agent is powered by Gemini 2.5 Flash. When a user selects an intent (email, message, search, media, emergency), the agent runs a conversation flow through a clarify overlay- asking questions like "Who should I email?" as large dwell-selectable tiles- then drafts content and executes via the Gmail API, email-to-SMS gateways, a new tab in your browser, or system media keys.

Phrase prediction uses Moorcheh's semantic memory engine. Every completed sentence is stored in a Moorcheh namespace. As the user types, queries fire against their personal phrase history, returning semantically similar past phrases as suggestion chips above the keyboard. The system ships with 20 seeded AAC phrases so it's useful from the first session.

Health monitoring stores nightly survey responses locally and in Moorcheh. After every three entries, Gemini reads the history, detects trends, and if it finds a concern, rates the severity (1–3) and generates a warm, dynamically-toned spoken alert via ElevenLabs- addressing the patient directly in their own voice.

Biggest complexities

Getting the system to actually work end-to-end was a gauntlet. The cursor sensitivity required extensive tuning- dead zones, EMA smoothing, baseline subtraction- and it's still imperfect. The Moorcheh suggestion chips took a long time to appear: the SDK returns results as {'results': [...]} rather than a flat list, so our extraction code was silently iterating over dictionary keys instead of results. The browse mode click required two separate pynput listeners- one suppressing the up arrow so it doesn't pass through to the browser, one non-suppressing for the F9 exit hotkey. The CV2 window randomly shrank after switching GPU modes on Windows, which turned out to be a DPI awareness issue requiring a SetProcessDpiAwareness(2) call before any window creation. And MediaPipe's new Tasks API is almost entirely undocumented with examples, so piecing together the correct LIVE_STREAM callback pattern took real digging.

What we would do again

We're proud that this works on a $0 hardware budget. No IR camera, no depth sensor, no specialist equipment- just a webcam that ships with every laptop. We're also proud of the agentic architecture: the eye tracker handles intent, and Gemini handles execution. That separation means the system is robust even with imprecise gaze- you don't need to click a 16px button, you just need to look at a 200px tile. And we're proud of the health monitoring layer, which we think is the most clinically meaningful part- a longitudinal check-in system that catches declining trends and speaks directly to the patient, not at them.

What we learned

The most surprising thing we learned was how few low-cost eye tracking solutions exist publicly. The field is dominated by $5,000–$15,000 dedicated hardware, and the gap between "research prototype" and "something a family can actually use" is enormous. We also learned that AI was essential not just as a feature, but as a research tool- without querying a broad knowledge base we would not have found MediaPipe's iris landmarks as a viable gaze estimation approach. That discovery took minutes with AI and would have taken many more hours without it. One of the things GazeKeys hopes to do is close that awareness gap: demonstrating that consumer hardware plus open models is a viable path for accessibility.

What's next for GazeKeys

The most important next step is accuracy. Webcam-based iris tracking has ~100–200px of noise, which is manageable for large targets but difficult for normal UI. We want to explore using a low-cost IR LED ring (under $10) to improve pupil contrast, and to implement a smarter gaze regression model that personalizes to each user's eye geometry over time. Beyond hardware, we want to expand the health monitoring into a proper longitudinal dashboard that caregivers and clinicians can access, and add speech-to-text input for users who still have some voice. The phrase prediction layer has real potential as a standalone AAC tool- we'd like to spin that out and make it available as an open API so other assistive technology builders can use it.

Built With

elevenlabs
gemini
mediapipe
numpy
opencv
pyautogui
pynput
python

Updates

Kyle Wei started this project — Mar 15, 2026 08:29 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.