Why We Built This

77% of people have speaking anxiety, but only 8% seek help. Professional speech coaching costs $100–300/hour — putting it out of reach for international students preparing for job interviews, neurodivergent professionals navigating workplace communication, and first-generation students who never had professional speaking modeled for them.

Existing tools like Yoodli and Orai only analyze audio, missing the 70% of communication that's nonverbal — your eyes, your posture, your body language. We wanted to build something that sees the full picture and makes practice fun, not stressful.


What SpeechMAX Does

SpeechMAX is a browser-based AI speech coach that analyzes your speaking across 5 dimensions in real-time:

  • Clarity — filler word detection via live transcription
  • Confidence — eye contact quality via 468 facial landmarks + iris tracking
  • Pacing — words-per-minute consistency and variation
  • Expression — pitch variation and vocal energy analysis
  • Composure — posture, fidgeting, blink rate, jaw tension, and biometric stress signals

You start by choosing your goal (interview, presentation, casual, or reading) then do a 30-second scan that builds your speech profile as an animated radar chart. Based on your weaknesses, SpeechMAX recommends targeted mini-games:

Game Trains How
Filler Ninja Clarity Speak without filler words — they get slashed on detection
Eye Lock Confidence Maintain eye contact — screen dims and warns when you look away
Pace Racer Pacing Keep your WPM in the target zone — gear system rewards sustained pace
Pitch Surfer Expression Vary your pitch to ride the wave — monotone = wipeout
Stage Presence Composure Master body language — open stance, gestures, commanding presence

Mike, your AI coach powered by Gemini 2.5 Flash, sees all your scores, game history, and badges — giving short, personalized advice through an in-app chat. Sign in with Google to sync progress across devices, or continue as a guest with zero friction.


How We Built It

All speech and video analysis runs 100% client-side — no audio or video ever leaves the browser:

  • MediaPipe FaceLandmarker — 468 facial landmarks + iris tracking + blendshapes for eye contact, blink rate, jaw tension, and lip compression
  • MediaPipe PoseLandmarker — 33 body keypoints for posture alignment, gesture quality, fidget detection, and bad habit recognition
  • Web Speech API — real-time transcription with confidence filtering and context-aware filler detection across 20+ filler words
  • Web Audio API — DynamicsCompressor → AnalyserNode chain for real-time pitch detection via autocorrelation

The backend uses Supabase for authentication (anonymous + Google OAuth), PostgreSQL with row-level security for data persistence, and an Edge Function that proxies Gemini API calls so the API key never touches the client.

The frontend is React 18 + TypeScript + Vite with Zustand for state management, Framer Motion for animations, and Tailwind CSS for styling. Every game has difficulty scaling based on your scan scores, goal-driven prompts, and per-game coaching tips.

Total API cost: $0. All ML runs client-side. Privacy by default.


Challenges We Faced

  • MediaPipe + Web Speech API running simultaneously causes heavy CPU load — we mitigated with frame-skipping (every 2nd frame) and singleton model caching so models aren't re-downloaded on navigation

  • Eye tracking race condition — the video element didn't exist when the gaze model tried to attach. We separated stream acquisition from video attachment and waited for the loadeddata event

  • Web Speech API merges repeated words — saying "um um um" becomes "um" in the transcript. We switched to count-based tracking on interim results instead of position-based detection

  • Expired JWT tokens on anonymous auth caused the Gemini proxy to silently return 401 errors. We switched to anon key authentication on the edge function to avoid token lifecycle issues

  • Filler detection during silence — the streak timer kept ticking when the user stopped talking. We added a 3-second silence threshold that pauses the game with visual indicators


What We Learned

  • Running multiple real-time ML models in the browser is viable but requires careful resource management — singleton patterns, frame skipping, and parallel initialization made the difference between smooth and janky

  • Gamification transforms a stressful activity into something people actually want to do. The gear system in Pace Racer and the streak counter in Filler Ninja create genuine engagement loops

  • Anonymous auth with zero-friction onboarding is critical for hackathon demos — judges shouldn't need to create an account to try your product

  • Supabase Edge Functions solve the API key exposure problem elegantly — one function, deployed once, and the secret never touches the frontend


What's Next

  • Deepgram Nova-3 for server-side transcription — solves the repeated word merging limitation and enables multilingual support
  • Session recording and playback — review your practice sessions with timestamped annotations
  • Multiplayer practice rooms — practice with friends via WebRTC with real-time feedback
  • Custom prompt upload — paste your actual interview questions or presentation script

Built With

  • framermotion
  • googlegemini
  • lucidereact
  • mediapipe
  • postgresql
  • react
  • supabase
  • tailwindcss
  • typescript
  • vercel
  • vite
  • webaudioapi
  • webspeechapi
  • zustand
Share this project:

Updates