Inspiration
There's a five-year-old somewhere right now telling an elaborate bedtime story to a stuffed animal. The fox finds the magic door. The mushrooms glow blue. The owl wears tiny glasses. She will never write this story down — she's five.
Current AI tools don't help her. Text generators require typing. Image generators produce isolated pictures. Neither captures the experience of storytelling — the momentum, the surprise, the feeling that you and the story are building something together. We wanted to close the gap between "I have a story" and "here is my story" — using nothing but voice.
What it does
DreamLoom is a voice-first AI story studio. You talk to Loom — an AI creative director with personality and creative taste — and watch illustrated scenes materialize in real time.
How a session works:
- You speak — describe a character, a world, or a story idea. Loom responds with voice, asks about art style and narrator tone conversationally, then generates the first scene.
- Scenes arrive interleaved — narration paragraphs and original illustrations woven together in a single Gemini API response (
response_modalities=["TEXT","IMAGE"]). Not two API calls stitched together. - You interrupt — say "wait, make it nighttime" mid-sentence. Loom pivots instantly. New scene, new mood, new music.
- You show a sketch — hold up a pencil drawing to your webcam. Loom sees it and incorporates the concept into the next scene, rendered in the story's art style.
- Continuity holds — ask "does Mira still have the compass?" three scenes later. Loom answers from its Story Bible — tracked characters, items, and world state across the entire narrative.
- Director's Cut — Loom generates cover art, a logline, and trailer narration. DreamLoom assembles a cinematic trailer (Ken Burns animation, per-scene music, AI voiceover) entirely client-side. Export as Storybook PDF or image ZIP.
Kid-safe mode is ON by default. No violent, scary, or mature content. When a child asks for "blood everywhere," Loom redirects creatively: "How about a shadow that turns out to be friendly?" The youngest user — a five-year-old with a parent — is the design target.
Built for families (bedtime stories), teachers (classroom adventures), and non-writers (people with imaginations who don't type prompts).
How we built it
Two-model architecture:
- Conversation Model — Gemini Live API via Google ADK
run_live()handles real-time bidirectional voice with barge-in detection. Audio streams over WebSocket: 16 kHz PCM in, 24 kHz out. - Scene Model — Gemini interleaved output (
gemini-2.5-flash-imagewithresponse_modalities=["TEXT","IMAGE"]) generates illustrated scenes. Called by the agent'screate_scenetool.
A single Director Agent "Loom" bridges the two through 6 callable tools: create_scene, generate_music, create_directors_cut, set_story_metadata, add_character, get_story_context.
Five Gemini models in one session:
| Role | Model |
|---|---|
| Voice conversation | gemini-2.5-flash-native-audio (Live API) |
| Scene generation | gemini-2.5-flash-image (interleaved text+image) |
| Music composition | lyria-realtime-exp (48 kHz stereo) |
| Trailer narration | gemini-2.5-flash-preview-tts |
| Audio transcription (reconnect) | gemini-2.5-flash |
Stack: FastAPI + WebSocket backend on Cloud Run (us-central1). React 19 + Vite + TailwindCSS v4 + Framer Motion frontend. GCS for media. Firestore for session persistence and public gallery. Client-side video assembly via Canvas + Web Audio + MediaRecorder (no FFmpeg).
Challenges we ran into
1. Voice barge-in was unusable out of the box. Default VAD settings triggered false interrupts on background music, natural pauses, and ambient noise. We tuned start_of_speech_sensitivity=LOW, silence_duration_ms=1200, and added a 400ms false-interrupt verification window — if no real speech arrives after an interrupt event, we inject a system message telling Loom to continue where it left off. A cough pauses Loom briefly; it picks right back up.
2. Live API drops connections mid-story. We buffer user and agent audio in rolling 15-second ring buffers. On disconnect: transcribe both buffers with Gemini Flash, re-inject full story state (conversation history + Story Bible + scene summaries), and reconnect transparently. 5 retries with exponential backoff. The story doesn't break.
3. Scene generation blocks the Live API. Interleaved generation takes 10-30 seconds. If the tool blocks, the Live API connection times out (error 1011). All generation runs as background asyncio.create_task(). The tool returns immediately, Loom fills the silence with conversation, and results push to the frontend via a notification queue.
4. Music caused self-interruption. The agent's own audio was feeding back into the mic. We implemented three-tier music ducking: mic active → 0%, agent speaking → 10%, idle → 25%, with smooth 200ms volume ramps.
Accomplishments that we're proud of
- Native interleaved output as a core feature, not a demo — every scene proves
response_modalities=["TEXT","IMAGE"]via a built-in Debug Panel showing model name, modalities array, part order, and generation time. - Five Gemini models orchestrated in a single user session — Live API, interleaved output, Lyria RealTime, TTS, and Flash for reconnect transcription. Every model is load-bearing.
- The Director's Cut — cover art, logline, cinematic trailer with AI narration and per-scene music, Storybook PDF, and image ZIP. 720 lines of client-side video assembly. No FFmpeg. No server rendering.
- False-interrupt recovery — the 400ms verification window that distinguishes real barge-in from noise. This single fix made the difference between a brittle demo and a usable product.
- Kid-safe by design — not a filter on top, but a core system prompt directive. Loom redirects with creative alternatives, not refusals.
- Production deployment at https://getdreamloom.com — Cloud Run, GCS, Firestore, automated via a single deploy script.
What we learned
- Voice UX is 90% edge cases. The happy path works in an hour. Making it survive background noise, false barge-ins, connection drops, and the agent's own audio feeding back took weeks.
- Two models are better than one. The Live API can't generate images. The interleaved model can't do real-time voice. Splitting them and bridging with ADK tools gives you the best of both.
- Fill the silence. Scene generation takes 10-30 seconds. If Loom goes quiet, users think it's broken. Teaching the agent to keep talking during generation ("This one's going to be gorgeous — what happens next?") transformed the experience.
- Prove your claims in the product. The Debug Panel isn't a dev tool — it's built for judges and users. Showing
response_modalities: ["TEXT","IMAGE"]alongside the generated scene removes all doubt about native interleaved output.
What's next for DreamLoom
- Multi-voice narration — different Gemini TTS voices per character in the trailer
- Scene branching — visual tree view for "What If?" alternate story paths
- Collaborative voice — multiple users speaking into the same story session
- ePub export — alongside PDF, export for e-readers with chapter structure
- Prompt replay — record the full voice conversation alongside the generated story so others can see the creative process
Built With
- audioworklet
- canvas-api
- fastapi
- firestore
- framer-motion
- gemini-flash
- gemini-interleaved-output
- gemini-live-api
- gemini-tts
- google-adk
- google-cloud
- google-cloud-run
- jspdf
- jszip
- lyria-realtime
- mediarecorder
- python
- react
- tailwindcss
- typescript
- vite
- web-audio-api
- websocket
Log in or sign up for Devpost to join the conversation.