DreamLoom

Introduction
Export
Signature Moments
Who is it for?
Story Bible
Architecture

Inspiration

There's a five-year-old somewhere right now telling an elaborate bedtime story to a stuffed animal. The fox finds the magic door. The mushrooms glow blue. The owl wears tiny glasses. She will never write this story down — she's five.

Current AI tools don't help her. Text generators require typing. Image generators produce isolated pictures. Neither captures the experience of storytelling — the momentum, the surprise, the feeling that you and the story are building something together. We wanted to close the gap between "I have a story" and "here is my story" — using nothing but voice.

What it does

DreamLoom is a voice-first AI story studio. You talk to Loom — an AI creative director with personality and creative taste — and watch illustrated scenes materialize in real time.

How a session works:

You speak — describe a character, a world, or a story idea. Loom responds with voice, asks about art style and narrator tone conversationally, then generates the first scene.
Scenes arrive interleaved — narration paragraphs and original illustrations woven together in a single Gemini API response (response_modalities=["TEXT","IMAGE"]). Not two API calls stitched together.
You interrupt — say "wait, make it nighttime" mid-sentence. Loom pivots instantly. New scene, new mood, new music.
You show a sketch — hold up a pencil drawing to your webcam. Loom sees it and incorporates the concept into the next scene, rendered in the story's art style.
Continuity holds — ask "does Mira still have the compass?" three scenes later. Loom answers from its Story Bible — tracked characters, items, and world state across the entire narrative.
Director's Cut — Loom generates cover art, a logline, and trailer narration. DreamLoom assembles a cinematic trailer (Ken Burns animation, per-scene music, AI voiceover) entirely client-side. Export as Storybook PDF or image ZIP.

Kid-safe mode is ON by default. No violent, scary, or mature content. When a child asks for "blood everywhere," Loom redirects creatively: "How about a shadow that turns out to be friendly?" The youngest user — a five-year-old with a parent — is the design target.

Built for families (bedtime stories), teachers (classroom adventures), and non-writers (people with imaginations who don't type prompts).

How we built it

Two-model architecture:

Conversation Model — Gemini Live API via Google ADK run_live() handles real-time bidirectional voice with barge-in detection. Audio streams over WebSocket: 16 kHz PCM in, 24 kHz out.
Scene Model — Gemini interleaved output (gemini-2.5-flash-image with response_modalities=["TEXT","IMAGE"]) generates illustrated scenes. Called by the agent's create_scene tool.

A single Director Agent "Loom" bridges the two through 6 callable tools: create_scene, generate_music, create_directors_cut, set_story_metadata, add_character, get_story_context.

Five Gemini models in one session:

Role	Model
Voice conversation	`gemini-2.5-flash-native-audio` (Live API)
Scene generation	`gemini-2.5-flash-image` (interleaved text+image)
Music composition	`lyria-realtime-exp` (48 kHz stereo)
Trailer narration	`gemini-2.5-flash-preview-tts`
Audio transcription (reconnect)	`gemini-2.5-flash`

Stack: FastAPI + WebSocket backend on Cloud Run (us-central1). React 19 + Vite + TailwindCSS v4 + Framer Motion frontend. GCS for media. Firestore for session persistence and public gallery. Client-side video assembly via Canvas + Web Audio + MediaRecorder (no FFmpeg).

Challenges we ran into

1. Voice barge-in was unusable out of the box. Default VAD settings triggered false interrupts on background music, natural pauses, and ambient noise. We tuned start_of_speech_sensitivity=LOW, silence_duration_ms=1200, and added a 400ms false-interrupt verification window — if no real speech arrives after an interrupt event, we inject a system message telling Loom to continue where it left off. A cough pauses Loom briefly; it picks right back up.

2. Live API drops connections mid-story. We buffer user and agent audio in rolling 15-second ring buffers. On disconnect: transcribe both buffers with Gemini Flash, re-inject full story state (conversation history + Story Bible + scene summaries), and reconnect transparently. 5 retries with exponential backoff. The story doesn't break.

3. Scene generation blocks the Live API. Interleaved generation takes 10-30 seconds. If the tool blocks, the Live API connection times out (error 1011). All generation runs as background asyncio.create_task(). The tool returns immediately, Loom fills the silence with conversation, and results push to the frontend via a notification queue.

4. Music caused self-interruption. The agent's own audio was feeding back into the mic. We implemented three-tier music ducking: mic active → 0%, agent speaking → 10%, idle → 25%, with smooth 200ms volume ramps.

Accomplishments that we're proud of

Native interleaved output as a core feature, not a demo — every scene proves response_modalities=["TEXT","IMAGE"] via a built-in Debug Panel showing model name, modalities array, part order, and generation time.
Five Gemini models orchestrated in a single user session — Live API, interleaved output, Lyria RealTime, TTS, and Flash for reconnect transcription. Every model is load-bearing.
The Director's Cut — cover art, logline, cinematic trailer with AI narration and per-scene music, Storybook PDF, and image ZIP. 720 lines of client-side video assembly. No FFmpeg. No server rendering.
False-interrupt recovery — the 400ms verification window that distinguishes real barge-in from noise. This single fix made the difference between a brittle demo and a usable product.
Kid-safe by design — not a filter on top, but a core system prompt directive. Loom redirects with creative alternatives, not refusals.
Production deployment at https://getdreamloom.com — Cloud Run, GCS, Firestore, automated via a single deploy script.

What we learned

Voice UX is 90% edge cases. The happy path works in an hour. Making it survive background noise, false barge-ins, connection drops, and the agent's own audio feeding back took weeks.
Two models are better than one. The Live API can't generate images. The interleaved model can't do real-time voice. Splitting them and bridging with ADK tools gives you the best of both.
Fill the silence. Scene generation takes 10-30 seconds. If Loom goes quiet, users think it's broken. Teaching the agent to keep talking during generation ("This one's going to be gorgeous — what happens next?") transformed the experience.
Prove your claims in the product. The Debug Panel isn't a dev tool — it's built for judges and users. Showing response_modalities: ["TEXT","IMAGE"] alongside the generated scene removes all doubt about native interleaved output.

What's next for DreamLoom

Multi-voice narration — different Gemini TTS voices per character in the trailer
Scene branching — visual tree view for "What If?" alternate story paths
Collaborative voice — multiple users speaking into the same story session
ePub export — alongside PDF, export for e-readers with chapter structure
Prompt replay — record the full voice conversation alongside the generated story so others can see the creative process

Built With

audioworklet
canvas-api
fastapi
firestore
framer-motion
gemini-flash
gemini-interleaved-output
gemini-live-api
gemini-tts
google-adk
google-cloud
google-cloud-run
jspdf
jszip
lyria-realtime
mediarecorder
python
react
tailwindcss
typescript
vite
web-audio-api
websocket

Updates

kaviya kumar started this project — Mar 16, 2026 06:15 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.