Sonic Buddy: A Real-Time Voice AR Companion for Children
Inspiration
I currently have a 9-month-old son. While he is still too young to operate a tablet on his own, I often see slightly older children around town holding tablets, silently staring at YouTube videos. While digital devices are convenient for parents, I felt a sense of unease seeing children engaged in purely "passive experiences."
"When my son touches a tablet in a few years, I want him to have an 'active experience' where his words and imagination move the world, rather than just consuming a passive video."
For young children who cannot use a keyboard or mouse, their "voice" is their greatest interface. If the words they spoke instantly reflected in the space before them like magic, imagine how their eyes would light up.
That hope started this project. We designed it to require no dedicated app installation, so that when my son starts speaking, he can instantly experience magic just by opening a web browser on a familiar smartphone or tablet.
What it does
A five-year-old stands in front of the camera. "Give me cat ears!" she says. Half a second later, Nova replies "Here you go, cat ears!" while fluffy ears appear on her head in the AR view. She laughs. "Now sparkles!" Both hands start glowing. "Fireworks too!" The whole screen erupts.
She never touched a keyboard or a mouse.
This is Sonic Buddy — a real-time AR companion for children in the early elementary age range, built around Nova 2 Sonic's bidirectional voice streaming and native tool use.
- Over 30 visual effects respond to voice commands.
- Beyond voice, intuitive AR physical interactions let you physically pinch and drag AR items on the screen, or even feed them to your "mouth" to make them disappear.
- Works instantly in any mobile or tablet browser — no app install required. Zero camera frames leave the browser.
Child: "Give me cat ears!"
↓ audio stream (WebSocket)
Nova 2 Sonic: understands context, selects tool
↓ tool event: apply_effect({ effectId: "ITEM_CAT_EARS" })
Browser: renders cat ears at face-tracked position
↓ audio response stream
Nova: "Here you go, cat ears!" (spoken reply)
This cycle completes in ~0.5–1 second.
The child speaks naturally. Nova 2 Sonic handles real-time bidirectional voice — true speech-to-speech with no text conversion in the middle. When the model decides to act, it calls tools that directly control 30+ AR effects (cat ears, crowns, fireworks, hand sparkles, screen stickers, and more).
Face, body, and hand tracking run locally via MediaPipe (WASM + GPU). Zero camera frames are uploaded. Hand gestures (wave) and expression-based mood signals are detected on-device and injected as context for the next voice turn — they do not trigger standalone responses without speech. Decorative stickers appear at launch — something playful is happening in the first 3 seconds.
Why Amazon Nova was essential
| Requirement | Alexa / Siri / Google | Nova 2 Sonic |
|---|---|---|
| Custom tool execution | Limited (Skill/Action) | Native tool use |
| Bidirectional audio streaming | Request-response | Real-time bidirectional |
| Transform on-screen world mid-conversation | Not supported or limited | Direct control via tool events |
| Speech-to-speech (no text intermediary) | Internally text-mediated | True speech-to-speech |
| Robustness to short child utterances | Command-pattern dependent | Contextual inference |
Nova 2 Sonic's bidirectional streaming + native tool use closes the speak → understand → act loop in under 1 second. We could not find another model that does this.
How we built it
To close the loop of speak → understand → act, we rely entirely on Amazon Nova 2 Sonic's bidirectional streaming and native tool use.
Privacy and safety
- Zero camera frames leave the browser — all tracking is local
- Audio streams only for the live session — no recording, no persistence
- Expression signals are hints, never certainty — the prompt prohibits overclaiming
- Bedrock Guardrails (ApplyGuardrail API) is applied to transcripts before model processing.
Architecture
- Frontend: Runs in any smartphone/tablet browser. Handles mic input (VAD) and camera. MediaPipe face/hand/body tracking and Canvas effect rendering are executed entirely locally.
- Backend (AWS): CloudFront → ALB → ECS Fargate (Node.js). Orchestrates audio and bidirectional WebSocket streaming with Nova 2 Sonic.
Tool design
The model calls 4 explicitly defined tools:
| Tool | Purpose | Example |
|---|---|---|
apply_effect |
Select and execute 30+ visual effects | Cat ears, fireworks, rainbow, hand sparkles |
set_theme |
Switch world theme | Candy Land, Space, Deep Sea |
bind_alias |
Bind a child's invented word to an effect | "sparkleblast" → SPARKLE_BIG |
get_live_mood |
Retrieve local expression hints (injected into next turn) | Smiling → brighten effects |
What the model decided is traceable. What the renderer executed is reproducible. Local-only data (face, hands) never leaks to the cloud.
(Note: We also provide a debug menu in the top-left corner to instantly test all 30+ AR effects with one click.)
Challenges we ran into
- Safety: We apply Amazon Bedrock Guardrails (ApplyGuardrail API) to transcripts before model processing, catching inappropriate input before it reaches Nova 2 Sonic.
- Child speech is messy. Short fragments, made-up words, mid-sentence topic changes. We tuned the system prompt and endpointing sensitivity to avoid cutting children off.
- MediaPipe + Canvas + WebSocket + audio in one tab pushes performance. We set tracker FPS limits and effect particle caps so browsers would run smoothly without crashing.
Accomplishments we're proud of
- Executing magic in under a second: We shortened the act-and-response loop enough to satisfy a child's short attention span.
- Connecting Nova's strength to real UX: Using true bidirectional streaming + native tool use rather than just chatting, proving that Nova can actually change the visible world mid-conversation.
- Zero-install safety: Achieved a purely browser-based AR experience where your camera frames never reach the cloud.
- The "Magic Word" Feature (
bind_alias): The model understands when a child invents a magic word, linking it instantly to visual impacts. It's a completely new kind of AI play.
What we learned
- Sub-second tool execution matters more than tool variety. If the response takes too long, children quickly drift away.
- Starting with something already on screen dramatically reduces blank-screen dropoff.
bind_alias— letting a child invent a word that triggers an effect — gets the strongest reaction from adults watching.
What's next for Sonic Buddy
Finger-level gesture detection, precise item placement for different facial features, and seasonal asset packs (e.g., proper educational classroom themes). A voice-and-gesture-only AR interface is perfect for museums, libraries, and public events where you want zero-onboarding. It can easily become a tool to encourage pre-readers and children to speak up and express themselves.
Built With
- amazon-web-services
- augmented-reality
- bedrock
- next.js
- nova
Log in or sign up for Devpost to join the conversation.