Inspiration

There are 285 million visually impaired people in the world. Over 75 million people live with motor disabilities severe enough to prevent them from using a keyboard or mouse. For millions more, conditions like ALS, Parkinson's, cerebral palsy, or spinal cord injuries turn a computer — the most powerful tool of our time — into something just out of reach.

Every voice assistant that exists today answers questions. None of them can actually do things on a computer. Siri tells you the weather. Bixby sets a timer. But none of them can open your email, read it to you, draft a reply, and send it. None of them can navigate a government website for someone who cannot see it. None of them can be your hands.

That gap is what SUVI exists to close.

What it does

SUVI is an AI agent that sees your screen, hears your voice, and takes action — on any application, without any plugins, browser extensions, or API access. It works the same way a human assistant would: by looking at the screen and figuring out what to do.

You talk to SUVI the way you would talk to a person sitting next to you:

  • "Open my email and read me the latest message from my doctor"
  • "Book a train ticket to Mumbai for next Friday"
  • "Open VS Code and write a function to sort a list"
  • "Stop" — and it stops, immediately, mid-task

For motor-impaired users, SUVI is their hands — full computer control through voice alone, no mouse, no keyboard required.

For non-speaking users, SUVI accepts typed text and executes it the same way — the interface adapts to the person, not the other way around.

How we built it

UVI is built as a four-layer system, each layer with a clear responsibility.

The local desktop client is a PyQt6 application — a single Python process that handles everything on the user's machine: the login and settings window, the floating chat widget, microphone capture via sounddevice, screen capture via python-mss, and action execution via PyAutoGUI and Playwright. The decision to use PyQt6 instead of Electron was deliberate — it eliminates the IPC layer between Node.js and Python entirely, reducing action latency from ~2ms per call to ~0.001ms and removing an entire category of failure modes.

The Gemini Live API powers the voice layer. A persistent WebSocket session streams microphone audio in real time and receives TTS audio back. The session stays open across the entire interaction, making the experience feel like a genuine conversation rather than a sequence of requests.

The computer vision loop uses Gemini's computer use capability. When the user gives a task, SUVI captures a screenshot and sends it to the model along with the intent. The model returns a precise action — click at these coordinates, type this text, press this key. SUVI executes it, captures another screenshot, sends it back, and the loop continues until the task is complete or the user says stop.

The cloud layer runs on Google Cloud. A FastAPI gateway on Cloud Run handles authentication via Firebase, manages WebSocket sessions, and acts as the bridge between the desktop client and Vertex AI. The SUVI Orchestrator agent, built with Google ADK, runs on Vertex AI Agent Engine and handles complex task planning. Firestore stores session memory and user profiles. Every desktop action is logged to Cloud Logging and telemetry flows through Pub/Sub into BigQuery.

The entire project was built solo in under a week, using Gemini CLI as the primary development tool — which felt appropriate given what we were building.

Challenges we ran into

The IPC problem. The original design used Electron for the UI and Python for execution — two separate processes. Every action crossed a process boundary with JSON serialization, socket overhead, and duplication of screen frames in two separate heaps. Switching to PyQt6 mid-build was a significant architectural decision but the right one.

Gemini Live API WebSocket stability. The Live API connection would drop silently with error code 1008 and no close frame. Debugging this took longer than expected — the root cause was a combination of the wrong model string, tool declarations being passed alongside AUDIO-only response modalities which the API does not support, and Cloud Run dropping idle WebSocket connections before a keepalive loop was implemented.

Intent extraction from native audio models. The Live API model reasons out loud before acting — it outputs multiple turns of internal thought before reaching a conclusion. Building a reliable parser that fires on the right turn with the right intent, without triggering on the model's own reasoning text, required accumulating the full turn buffer and only parsing on the final complete turn.

Cloud Run deployment from a monorepo. Deploying from a GitHub repository where the gateway lives in a subdirectory meant every import path needed to be relative to the build context, not the repository root. This caused several failed deployments before the correct Dockerfile and import structure were established.

Google Cloud project misconfiguration. Working across multiple Google accounts on the same machine meant Application Default Credentials silently pointed to the wrong project, causing authentication failures that looked like API errors. Resolving this required switching gcloud auth application-default login to the correct account and verifying every credential source independently.

Time. Building a production-grade, cloud-deployed, multimodal AI agent solo in under a week while debugging infrastructure in parallel is genuinely hard. Every hour spent on a deployment error was an hour not spent on the product.

Accomplishments that we're proud of

Universal app control without any API access. SUVI works on every application — Chrome, VS Code, Notepad, Excel, government websites, anything — without plugins, browser extensions, or DOM access. It uses vision the way a human would. This is the core technical achievement.

A real accessibility product, not a demo. SUVI was originally conceived as a general productivity tool. Midway through the build, the mission shifted: build this for the people who need it most. The three-mode input system — voice for motor-impaired users, screen narration for blind users, text input for non-speaking users — emerged from that shift and makes SUVI a fundamentally different kind of product.

Single Python runtime with no IPC. The entire desktop client in PyQt6 — UI, voice capture, screen capture, and action execution all in one process — is unusual and correct. It makes SUVI faster, simpler to deploy, and easier to debug than a multi-process hybrid would be.

A complete cloud infrastructure in one week. Cloud Run gateway, Vertex AI ADK agent, Firestore memory, Pub/Sub telemetry, BigQuery analytics, Cloud Logging audit trail, Firebase authentication, Secret Manager for credentials — all deployed, connected, and working from a single codebase. Built by one person.

Session Replay. Every task SUVI completes is recorded as an animated GIF and uploaded to Google Cloud Storage. This started as a submission proof mechanism and became a feature — users can share exactly what SUVI did on their behalf, creating transparency and trust.

What we learned

The most important lesson was about who we were building for. SUVI started as a productivity tool. It became an accessibility product the moment we asked the right question: who needs this most? The answer changed everything — the features we prioritized, the way we described the product, and the reason we kept building when things broke.

On the technical side, the Gemini Live API is powerful but requires precise configuration. Model strings, response modalities, tool declarations, and context window management all interact in ways that are not always obvious from the documentation. Understanding how the model accumulates and emits a turn — and building application logic around that structure — was the most nuanced engineering challenge of the project.

We also learned that cloud deployment is its own discipline. Writing code that works locally is one problem. Making it work reliably on Cloud Run, with the right build context, the right import paths, the right environment variables, and the right WebSocket configuration, is a separate and equally demanding problem.

The biggest personal learning: scope ruthlessly. A solo builder in a week cannot build everything. Every feature that did not directly serve the demo or the submission was a distraction. The features that matter are the ones that work when a judge is watching.

What's next for Suvi

Making it genuinely accessible to the people who need it. The current version requires technical setup. The next version needs a one-click installer, a hosted backend that users do not have to manage, and a free tier for disabled users who cannot afford subscription pricing. This is the most important next step.

Mobile companion app. Many disabled users rely on smartphones more than desktops. A mobile version of SUVI — or a companion app that lets users control their desktop from their phone — would dramatically expand who can use it.

Multilingual support. The Gemini Live API supports multiple languages natively. SUVI should work in Hindi, Tamil, Marathi, Arabic, and every other language its users speak. Accessibility cannot be English-only.

Proactive assistance. Right now SUVI waits to be asked. The next version should notice things — an unread message from a doctor, a calendar event approaching, a task left incomplete — and offer help without being asked. The best assistant is one that anticipates.

Integration with assistive technology ecosystems. Screen readers, switch access devices, eye trackers — SUVI should work alongside existing assistive technology, not compete with it. Deep integration with the accessibility APIs on Windows, macOS, and Linux would make SUVI a layer that enhances everything a user already has.

Open sourcing the core. The computer use loop, the voice pipeline, and the accessibility-first interaction model should be available to every developer who wants to build accessible software. SUVI's architecture should become infrastructure.

The dream is simple: every person who needs a computer should be able to use one. SUVI is one step toward that. There is a lot of road left.

Built With

Share this project:

Updates