Sonic Buddy: A Real-Time Voice AR Companion for Children

Inspiration

I currently have a 9-month-old son. While he is still too young to operate a tablet on his own, I often see slightly older children around town holding tablets, silently staring at YouTube videos. While digital devices are convenient for parents, I felt a sense of unease seeing children engaged in purely "passive experiences."

"When my son touches a tablet in a few years, I want him to have an 'active experience' where his words and imagination move the world, rather than just consuming a passive video."

For young children who cannot use a keyboard or mouse, their "voice" is their greatest interface. If the words they spoke instantly reflected in the space before them like magic, imagine how their eyes would light up.

That hope started this project. We designed it to require no dedicated app installation, so that when my son starts speaking, he can instantly experience magic just by opening a web browser on a familiar smartphone or tablet.

What it does

A five-year-old stands in front of the camera. "Give me cat ears!" she says. Half a second later, Nova replies "Here you go, cat ears!" while fluffy ears appear on her head in the AR view. She laughs. "Now sparkles!" Both hands start glowing. "Fireworks too!" The whole screen erupts.

She never touched a keyboard or a mouse.

This is Sonic Buddy — a real-time AR companion for children in the early elementary age range, built around Nova 2 Sonic's bidirectional voice streaming and native tool use.

Over 30 visual effects respond to voice commands.
Beyond voice, intuitive AR physical interactions let you physically pinch and drag AR items on the screen, or even feed them to your "mouth" to make them disappear.
Works instantly in any mobile or tablet browser — no app install required. Zero camera frames leave the browser.

Child: "Give me cat ears!"
    ↓ audio stream (WebSocket)
Nova 2 Sonic: understands context, selects tool
    ↓ tool event: apply_effect({ effectId: "ITEM_CAT_EARS" })
Browser: renders cat ears at face-tracked position
    ↓ audio response stream
Nova: "Here you go, cat ears!" (spoken reply)

This cycle completes in ~0.5–1 second.

The child speaks naturally. Nova 2 Sonic handles real-time bidirectional voice — true speech-to-speech with no text conversion in the middle. When the model decides to act, it calls tools that directly control 30+ AR effects (cat ears, crowns, fireworks, hand sparkles, screen stickers, and more).

Face, body, and hand tracking run locally via MediaPipe (WASM + GPU). Zero camera frames are uploaded. Hand gestures (wave) and expression-based mood signals are detected on-device and injected as context for the next voice turn — they do not trigger standalone responses without speech. Decorative stickers appear at launch — something playful is happening in the first 3 seconds.

Why Amazon Nova was essential

Requirement	Alexa / Siri / Google	Nova 2 Sonic
Custom tool execution	Limited (Skill/Action)	Native tool use
Bidirectional audio streaming	Request-response	Real-time bidirectional
Transform on-screen world mid-conversation	Not supported or limited	Direct control via tool events
Speech-to-speech (no text intermediary)	Internally text-mediated	True speech-to-speech
Robustness to short child utterances	Command-pattern dependent	Contextual inference

Nova 2 Sonic's bidirectional streaming + native tool use closes the speak → understand → act loop in under 1 second. We could not find another model that does this.

How we built it

To close the loop of speak → understand → act, we rely entirely on Amazon Nova 2 Sonic's bidirectional streaming and native tool use.

Privacy and safety

Zero camera frames leave the browser — all tracking is local
Audio streams only for the live session — no recording, no persistence
Expression signals are hints, never certainty — the prompt prohibits overclaiming
Bedrock Guardrails (ApplyGuardrail API) is applied to transcripts before model processing.

Architecture

Frontend: Runs in any smartphone/tablet browser. Handles mic input (VAD) and camera. MediaPipe face/hand/body tracking and Canvas effect rendering are executed entirely locally.
Backend (AWS): CloudFront → ALB → ECS Fargate (Node.js). Orchestrates audio and bidirectional WebSocket streaming with Nova 2 Sonic.

Tool design

The model calls 4 explicitly defined tools:

Tool	Purpose	Example
`apply_effect`	Select and execute 30+ visual effects	Cat ears, fireworks, rainbow, hand sparkles
`set_theme`	Switch world theme	Candy Land, Space, Deep Sea
`bind_alias`	Bind a child's invented word to an effect	"sparkleblast" → SPARKLE_BIG
`get_live_mood`	Retrieve local expression hints (injected into next turn)	Smiling → brighten effects

What the model decided is traceable. What the renderer executed is reproducible. Local-only data (face, hands) never leaks to the cloud.

(Note: We also provide a debug menu in the top-left corner to instantly test all 30+ AR effects with one click.)

Challenges we ran into

Safety: We apply Amazon Bedrock Guardrails (ApplyGuardrail API) to transcripts before model processing, catching inappropriate input before it reaches Nova 2 Sonic.
Child speech is messy. Short fragments, made-up words, mid-sentence topic changes. We tuned the system prompt and endpointing sensitivity to avoid cutting children off.
MediaPipe + Canvas + WebSocket + audio in one tab pushes performance. We set tracker FPS limits and effect particle caps so browsers would run smoothly without crashing.

Accomplishments we're proud of

Executing magic in under a second: We shortened the act-and-response loop enough to satisfy a child's short attention span.
Connecting Nova's strength to real UX: Using true bidirectional streaming + native tool use rather than just chatting, proving that Nova can actually change the visible world mid-conversation.
Zero-install safety: Achieved a purely browser-based AR experience where your camera frames never reach the cloud.
The "Magic Word" Feature (bind_alias): The model understands when a child invents a magic word, linking it instantly to visual impacts. It's a completely new kind of AI play.

What we learned

Sub-second tool execution matters more than tool variety. If the response takes too long, children quickly drift away.
Starting with something already on screen dramatically reduces blank-screen dropoff.
bind_alias — letting a child invent a word that triggers an effect — gets the strongest reaction from adults watching.

What's next for Sonic Buddy

Finger-level gesture detection, precise item placement for different facial features, and seasonal asset packs (e.g., proper educational classroom themes). A voice-and-gesture-only AR interface is perfect for museums, libraries, and public events where you want zero-onboarding. It can easily become a tool to encourage pre-readers and children to speak up and express themselves.

Built With

Updates

Ryota Kanamaru started this project — Mar 16, 2026 01:23 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.