My Daughter's Doodles Came to Life: Building an AI Tutor with Gemini & SAM 2

Inspiration

My daughter just started first grade here in Japan. It’s a huge milestone—the shiny new randoseru (backpack), the excitement of new friends, but also the daunting reality of homework. Watching her sit at her desk, I noticed two things: she loved drawing, but she found studying to be a chore. She would doodle monsters and princesses on the corners of her math worksheets instead of solving the problems.

That’s when it hit me. What if those doodles could help her study? As a developer (and a mom), I decided to build a solution to bridge the gap between her creativity and her education. I wanted to bring her drawings to life, not just as animations, but as intelligent, interactive study partners.

What it does

AmiBuddy is an AI-powered application that transforms static children's drawings into "Living AI Agents."

Animate Drawings: It turns a static sketch into a rigged character that "breathes" and moves using sine-wave animations.

Homework Analysis: It uses Gemini Vision to analyze homework sheets, understanding the intent, topics (like Addition or Kanji), and difficulty level.

Voice-First Tutor: It acts as a friendly study buddy, allowing the child to ask questions and receive encouraging, short explanations via ElevenLabs voices.

How we built it

I built AmiBuddy using a hybrid architecture:

Frontend Layer: React Native (Expo) for both mobile and web clients to ensure a smooth user experience.

Backend (Cloud Run): A Python (FastAPI) orchestrator that manages heavy AI processing.

Rigging Agent: Gemini 2.5 Flash performs "zero-shot" structural understanding to find joints and body parts (head, arms, legs) without needing a custom-trained model.

Segmentation Agent: Meta SAM 2 (Segment Anything Model) takes the bounding boxes from Gemini to extract pixel-perfect alpha masks for each limb.

Voice & Logic: Gemini 2.5 Flash handles the homework analysis and speech-to-text, while ElevenLabs provides the character's vocal identity.

Challenges we ran into

The biggest technical hurdle was ensuring the "Surgical Extraction" of the limbs didn't crop off hands or feet. I had to engineer the rigging prompt to be "generous" with bounding boxes and implement a padding logic in the backend to ensure SAM 2 got the whole context.

Additionally, handling mixed languages (Japanese/English) for voice recognition was a challenge that Gemini 2.5 Flash handled significantly better than traditional STT engines.

Accomplishments that we're proud of I’m proud of the "Living Agent" pipeline—specifically the hand-off between Gemini’s reasoning and SAM 2’s precision. For my daughter, AmiBuddy turned homework from a struggle into play. She’s now excited to show her "buddy" what she learned today. Seeing the character "come to life" in under a few seconds on a mobile screen feels like real magic.

What we learned I learned that Gemini has incredible "zero-shot" understanding for skeletal animation compared to traditional object detection models. I also learned that for a 6-year-old, typing is a barrier; making the application voice-first was essential to making the technology accessible and engaging for her age group.

What's next for Amibuddy I’m open-sourcing parts of this journey to help other parents and developers build their own magic. Future steps include adding "Memory" so the buddy can remember previous study sessions and developing a feature where the AI can "draw back" to explain math concepts visually on the screen.

📺 Watch the Demo: https://youtu.be/PFbhIW4gZxs

🚀 Try the App: https://amibuddy-frontend-535548706733.asia-northeast1.run.app/

Built With

Share this project:

Updates