wok.ai

wok.ai

Inspiration

Singapore's hawker culture is dying. UNESCO recognized it as intangible cultural heritage in 2020, but the reality on the ground tells a different story.

When Ah Gong closes his char kway teow stall, we lose the way he flicks his wrist to toss the noodles. The sound of his wok. How he knows the prawns are done by smell alone. The secret of wok hei that he's never written down because "you just feel it, lah." The moment he retires, that knowledge evaporates.

We built wok.ai because we refuse to let our favourite comfort food become memory. We want every ah ma's bak chor mee, every uncle's laksa, every family's signature dish to survive.

What it does

wok.ai transforms cooking videos into interactive, voice-guided cooking experiences. Upload any cooking video (or record a chef narrating a recipe), and our AI extracts not just the ingredients and steps, but the sensory cues, timing hints, and chef's pointers that make the difference between a good dish and a great one.

When you're ready to cook, activate the voice assistant and go hands-free. Ask "what's next?", "how much garlic again?", or "set a timer for 5 minutes" all hands free. wok.ai summarises each step highlighting the main action, key ingredients, timing, and shares the chef's tips in a conversational way.

Key features:

Video-to-recipe extraction using Gemini 2.5 Flash for multimodal analysis Voice recording for capturing family recipes with all their nuances Intelligent step structure separating brief instructions from detailed chef's pointers Hands-free voice assistant powered by ElevenLabs for natural conversation Voice commands for navigation (next step, go back, repeat) and timers Multi-language support for English, Chinese, and Malay recipes

How we built it

Frontend: Next.js 16 with App Router, React 19, and Tailwind CSS 4 for a responsive, modern UI with shadcn/ui components

Dual-Mode Video Analysis:

Gemini AI Mode: Google Gemini 2.5 Flash processes entire videos to extract transcripts, visual descriptions, ingredients, structured steps, and key moments with timestamps
CV Pipeline Mode: Client-side computer vision pipeline combining:
- Scene detection algorithm for keyframe extraction
- Google Cloud Vision API for object/ingredient identification
- MediaPipe Hands for cooking action recognition (chopping, stirring, mixing, etc.)
- OpenAI Whisper for audio transcription
- Intelligent assembly algorithm that merges all data sources into structured recipes

Voice Recording: ElevenLabs Speech-to-Text API for real-time transcription of voice narration while cooking

Voice Assistant: ElevenLabs Conversational AI with @elevenlabs/react SDK provides natural, ultra-low latency (~100-200ms) voice interaction via WebRTC, with custom client tools (nextStep, previousStep, repeatStep, setTimer) for seamless recipe navigation and timer management

Recipe Structuring: A two-pass AI system using Gemini Flash 2.5:

First pass extracts raw content from video or voice transcription
Second pass restructures into our instruction + description format that separates the "what to do" from the "how to do it well," including tips, warnings, and sensory cues

Multi-Language Support: Video analysis supports 6 languages (English, Chinese, Malay, Tamil, Indonesian, Thai) with language-specific prompting

Timer System: Custom TimerManager class handles multiple simultaneous timers with pause/resume functionality, visual countdowns, and audio/notification alerts

Database: Supabase (PostgreSQL) with JSONB columns for flexible recipe storage, Row-Level Security (RLS) for user data protection, and optional authentication

State Management: React hooks with careful memoization and useCallback to prevent voice assistant reconnection loops, plus context providers for authentication state

Video Processing:

Client-side: Canvas API for frame extraction and analysis
Server-side: FFmpeg for video manipulation when needed
Efficient blob handling for large video files

Architecture Highlights:

Server-side API routes keep sensitive API keys secure (no client exposure)
WebRTC for voice assistant provides 46% less code and better performance than custom WebSocket implementation
Modular CV pipeline allows enabling/disabling features based on available API keys
Progressive enhancement: app works without authentication, Vision API, or Whisper

Challenges we ran into

Voice assistant stability: The ElevenLabs useConversation hook was causing rapid connect/disconnect cycles. We traced this to callback functions being recreated on every render, which triggered the hook to restart. The fix required careful use of useCallback, useRef, and useMemo to stabilize all dependencies.

Backward compatibility: Midway through development, we restructured how steps are stored — from simple strings to objects with instruction and description fields. We had to build helper functions throughout the codebase to handle both formats gracefully, ensuring existing recipes still work.

Prompt engineering for structured output: Getting Gemini to consistently return valid JSON with our exact schema required extensive prompt iteration. We learned to provide concrete examples, explicit formatting guidelines, and handle cases where the AI wraps responses in markdown code blocks.

Accomplishments that we're proud of

The chef's wisdom preservation: Our step structure captures not just "add garlic" but "add garlic and cook for 30 seconds until fragrant — watch it carefully so it doesn't burn!" This is what makes recipes actually reproducible.

Truly hands-free cooking: The voice assistant understands context, navigates steps, sets timers, and answers ingredient questions — all without touching your device.

Video analysis quality: Gemini 2.5 Flash does an impressive job extracting both spoken instructions and visual cues, combining them into rich, actionable steps.

What we learnt

Not to take for granted the knowledge our elders behold, and preserve them while we can.

What's next for wok.ai Recipe scaling: "I'm cooking for 6 people instead of 4" — automatically adjust all ingredient quantities Ingredient substitution suggestions: "I don't have fish sauce" — suggest alternatives based on what the chef might recommend Technique library: Build a searchable database of cooking techniques extracted across all recipes

Built With

cv
elevenlabs
gemini
nextjs
supabase
tailwind

Updates

Wee Joe Tan started this project — Jan 17, 2026 06:10 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.