Inspiration
The idea for Mentora came from two frustrations every student knows.
The first is the context-switch problem: you're deep in a paper, a lecture slide, or a problem set, you hit something you don't understand and the moment you open a new tab to ask ChatGPT, you've already lost your place. You type out your question, wait for a wall of text, and then try to re-immerse yourself in what you were doing. The tool that's supposed to help you learn is actually breaking the flow of learning.
The second came from remote learning. When lectures moved online, a whole generation of students found themselves watching pre-recorded videos at home, with no way to raise their hand. You pause the video, you have a question, but there's no professor to ask and no classmate to turn to. You either push through confused, or you open another tab and break your focus trying to find an answer that may not even be contextualised to what you were just watching.
Mentora was built to solve both. It lives inside the learning environment, not alongside it, a tutor that can see what you're looking at, hear the video you're watching, hear your question the moment you have it, and respond in spoken audio without requiring you to lift your hands off your work.
What it does
Mentora is a Chrome extension with a real-time AI tutor in the side panel. It captures your screen, listens to your mic (and optionally your tab audio for videos), and streams everything to a backend powered by the Gemini Live API and Google ADK. The response comes back as spoken audio in under a second, the same latency you'd expect from talking to a person in the room.
It works across three real learning scenarios:
- Lecture notes and textbooks — open your slides or a paper, ask Mentora to explain what's on screen, interrupt mid-response with a follow-up, and save key concepts to your notes by voice
- Homework and problem sets — Socratic Autopilot mode detects when you've been idle and nudges you with a targeted question instead of giving you the answer; ask for a derivation and an interactive EquationSolver widget renders live in the panel with expandable LaTeX steps
- YouTube and online lecture videos — enable tab audio capture, let Mentora listen to the video, pause at any point and ask for a summary or clarification — no scrubbing back, no re-watching
Across all three, you can say "save this to my notes" at any point and Mentora captures the key concept automatically and saves to your notes.
How we built it
The architecture is designed around one constraint: real-time bidirectional audio with no perceptible latency.
The Chrome extension uses the Offscreen Document API (MV3) to own a persistent WebSocket connection, handle microphone and tab audio capture, and drive PCM16 audio playback, all without blocking the UI thread. The side panel is a React + TypeScript app that receives transcript and widget events over the service worker message bus.
The backend is a FastAPI server deployed on Google Cloud Run, with session affinity and a 3600-second timeout to keep WebSocket connections alive. It feeds audio and JPEG screen frames into the Google ADK Runner using LiveRequestQueue and StreamingMode.BIDI. The ADK agent connects to Gemini via a native bidirectional stream, audio in, audio out, tool calls in between.
Server-side voice activity detection (AutomaticActivityDetection) means we don't do any silence detection on the client at all. Tool calls happen over the live stream, when you say "save this", the model calls add_to_notes; when a visual explanation makes more sense, it calls render_generative_widget and the widget appears in the panel while the spoken explanation plays simultaneously.
Challenges we ran into
WebSocket persistence in Chrome MV3 - service workers are ephemeral and spin down after 30 seconds of inactivity. We solved this by moving the WebSocket into an Offscreen Document, which has a persistent lifetime, and routing all messages through the service worker as a relay.
Barge-in and audio state management - when the user speaks over Mentora mid-response, we need to immediately stop playback, drain queued audio chunks, and signal the server to discard pending output. Getting the timing right across the WebSocket boundary without cutting off words or letting stale audio play required careful sequencing of flush signals and playback queue clearing.
WebSocket connections on Cloud Run - Cloud Run's load balancer drops connections after 300 seconds by default. We fixed this with --session-affinity to pin a session to one container instance and --timeout=3600 to extend the window to match a realistic study session.
Generative UI timing - making the widget appear while the spoken summary is playing, not after, required the agent to emit the tool call first and the audio continuation second. This meant carefully structuring the ADK tool definitions so Gemini understood the expected output order.
Accomplishments that we're proud of
- Sub-second voice response latency - achieved through native Gemini audio understanding and generation, not a speech-to-text → LLM → TTS pipeline
- Barge-in that actually works - interrupting Mentora mid-sentence stops playback instantly, with no stale audio leaking through
- Generative UI rendered live - interactive widgets (EquationSolver, ProbabilityTable, Flashcard) appear in the panel alongside the spoken response, not after it
- Tab audio capture - Mentora can listen to a YouTube lecture in real time, making it the first AI tutor that can genuinely follow along with a video with you
- Socratic Autopilot - a tutoring mode with a strict zero-answer policy that nudges you toward the answer with questions, never just giving it to you
- Full Cloud Run deployment - the backend auto-scales from zero with no infrastructure to manage, making the product ready to serve real users
What we learned
Working with the Gemini Live API taught us that native audio understanding is qualitatively different from a speech-to-text → LLM → TTS pipeline. The model picks up on tone, pacing, and hesitation in a way that changes how it responds, which matters for a tutoring context where a confused "um, okay?" should be treated differently from a confident "got it, next question."
The Google ADK abstracted away a lot of the session and event loop complexity, but it required careful thinking about streaming tool call interleaving, how to emit a widget render call mid-stream without blocking the audio response. That was the most technically interesting problem we solved.
We also learned that the hardest part of building a real-time product isn't the AI, it's the plumbing. Managing audio state across a WebSocket boundary, keeping connections alive in a serverless environment, and handling the edge cases of barge-in and reconnection taught us more about distributed systems than we expected going into a hackathon.
What's next for Mentora AI
The foundation is solid. The near-term roadmap is about depth:
- Spaced repetition - automatically schedule saved notes as flashcard reviews based on forgetting curves
- Session history - persist full transcripts and notes across sessions so nothing is lost between study blocks
- More widget types - graph plotter, code runner, timeline builder triggered by context
- Firefox + Safari - extend beyond Chrome using the WebExtensions API
Longer-term, the more interesting direction is personalisation, building a model of what you specifically know and struggle with, so every explanation is calibrated to your actual knowledge state, not a generic student profile. Beyond individual use, collaborative tutoring sessions, institution integrations with LMS platforms like Canvas, and a native mobile experience are all natural extensions of the architecture we've already built.
More info on Google Cloud Deployment and Project Details on my github: https://github.com/LlamaWritesCode/mentora-ai
Built With
- chrome-extensions-api
- docker
- fastapi
- gemini-live-api
- google-adk
- google-artifact-registry
- google-cloud
- google-cloud-build
- google-cloud-run
- javascript
- jpeg-frame-capture
- pcm16-audio-streaming
- python
- react
- tailwind
- typescript
- vite
- websocket

Log in or sign up for Devpost to join the conversation.