Why We Built This
77% of people have speaking anxiety, but only 8% seek help. Professional speech coaching costs $100–300/hour — putting it out of reach for international students preparing for job interviews, neurodivergent professionals navigating workplace communication, and first-generation students who never had professional speaking modeled for them.
Existing tools like Yoodli and Orai only analyze audio, missing the 70% of communication that's nonverbal — your eyes, your posture, your body language. We wanted to build something that sees the full picture and makes practice fun, not stressful.
What SpeechMAX Does
SpeechMAX is a browser-based AI speech coach that analyzes your speaking across 5 dimensions in real-time:
- Clarity — filler word detection via live transcription
- Confidence — eye contact quality via 468 facial landmarks + iris tracking
- Pacing — words-per-minute consistency and variation
- Expression — pitch variation and vocal energy analysis
- Composure — posture, fidgeting, blink rate, jaw tension, and biometric stress signals
You start by choosing your goal (interview, presentation, casual, or reading) then do a 30-second scan that builds your speech profile as an animated radar chart. Based on your weaknesses, SpeechMAX recommends targeted mini-games:
| Game | Trains | How |
|---|---|---|
| Filler Ninja | Clarity | Speak without filler words — they get slashed on detection |
| Eye Lock | Confidence | Maintain eye contact — screen dims and warns when you look away |
| Pace Racer | Pacing | Keep your WPM in the target zone — gear system rewards sustained pace |
| Pitch Surfer | Expression | Vary your pitch to ride the wave — monotone = wipeout |
| Stage Presence | Composure | Master body language — open stance, gestures, commanding presence |
Mike, your AI coach powered by Gemini 2.5 Flash, sees all your scores, game history, and badges — giving short, personalized advice through an in-app chat. Sign in with Google to sync progress across devices, or continue as a guest with zero friction.
How We Built It
All speech and video analysis runs 100% client-side — no audio or video ever leaves the browser:
- MediaPipe FaceLandmarker — 468 facial landmarks + iris tracking + blendshapes for eye contact, blink rate, jaw tension, and lip compression
- MediaPipe PoseLandmarker — 33 body keypoints for posture alignment, gesture quality, fidget detection, and bad habit recognition
- Web Speech API — real-time transcription with confidence filtering and context-aware filler detection across 20+ filler words
- Web Audio API — DynamicsCompressor → AnalyserNode chain for real-time pitch detection via autocorrelation
The backend uses Supabase for authentication (anonymous + Google OAuth), PostgreSQL with row-level security for data persistence, and an Edge Function that proxies Gemini API calls so the API key never touches the client.
The frontend is React 18 + TypeScript + Vite with Zustand for state management, Framer Motion for animations, and Tailwind CSS for styling. Every game has difficulty scaling based on your scan scores, goal-driven prompts, and per-game coaching tips.
Total API cost: $0. All ML runs client-side. Privacy by default.
Challenges We Faced
MediaPipe + Web Speech API running simultaneously causes heavy CPU load — we mitigated with frame-skipping (every 2nd frame) and singleton model caching so models aren't re-downloaded on navigation
Eye tracking race condition — the video element didn't exist when the gaze model tried to attach. We separated stream acquisition from video attachment and waited for the
loadeddataeventWeb Speech API merges repeated words — saying "um um um" becomes "um" in the transcript. We switched to count-based tracking on interim results instead of position-based detection
Expired JWT tokens on anonymous auth caused the Gemini proxy to silently return 401 errors. We switched to anon key authentication on the edge function to avoid token lifecycle issues
Filler detection during silence — the streak timer kept ticking when the user stopped talking. We added a 3-second silence threshold that pauses the game with visual indicators
What We Learned
Running multiple real-time ML models in the browser is viable but requires careful resource management — singleton patterns, frame skipping, and parallel initialization made the difference between smooth and janky
Gamification transforms a stressful activity into something people actually want to do. The gear system in Pace Racer and the streak counter in Filler Ninja create genuine engagement loops
Anonymous auth with zero-friction onboarding is critical for hackathon demos — judges shouldn't need to create an account to try your product
Supabase Edge Functions solve the API key exposure problem elegantly — one function, deployed once, and the secret never touches the frontend
What's Next
- Deepgram Nova-3 for server-side transcription — solves the repeated word merging limitation and enables multilingual support
- Session recording and playback — review your practice sessions with timestamped annotations
- Multiplayer practice rooms — practice with friends via WebRTC with real-time feedback
- Custom prompt upload — paste your actual interview questions or presentation script
Built With
- framermotion
- googlegemini
- lucidereact
- mediapipe
- postgresql
- react
- supabase
- tailwindcss
- typescript
- vercel
- vite
- webaudioapi
- webspeechapi
- zustand





Log in or sign up for Devpost to join the conversation.