KidsChat is a local-first prototype web app for demonstrating AI to children in a playful, voice-friendly way. It combines a browser chat UI, local Ollama-hosted models, speech input/output, a talking-head avatar, and a small set of fun tools like pictures, sounds, jokes, math, and weather.
This is a prototype/demo application.
- It is intended to demo AI to children in a fun setting with an adult present.
- It has not been tested, safety-reviewed, or hardened for extensive 1-on-1 interaction with children.
- It should not be treated as a child-safety product, tutoring product, or unsupervised companion app.
- Tool outputs come from local models plus live third-party data sources and can still be wrong, awkward, or inappropriate in edge cases.
If you use this with children, do it with adult supervision.
- Text input plus push-to-talk microphone input
- Local speech-to-text with
faster-whisper - Local LLM responses through Ollama
- Optional cloud escalation to Claude, OpenAI, or Gemini when API keys are configured
- Browser-rendered talking-head avatar with HeadTTS + TalkingHead
- Local/server fallback TTS with Piper or macOS
say - Still-photo camera capture for asking a vision-capable local model about an image
- Tool calling for:
- image search
- sound search/playback
- kid-friendly jokes and facts
- weather
- math
- diagrams
- simple SVG drawings
- Python 3.11+
- FastAPI
- WebSockets
- Ollama Python client
faster-whisperfor STT- Piper TTS and macOS
sayfallback - Optional NeMo text normalization
- Optional Misaki + phonemizer/eSpeak phonetic preprocessing for HeadTTS
- Plain HTML, CSS, and vanilla JavaScript
- Mermaid for diagrams
- HeadTTS for browser-side speech
- TalkingHead + Three.js for the avatar
- Ollama for local model hosting
- Open-Meteo for weather
- Pexels / Unsplash / Openverse for images
- Freesound / Openverse audio for sound clips
- Optional Claude / OpenAI / Gemini cloud fallback
- The browser sends text or recorded audio to the backend over a WebSocket.
- For photo questions, the browser can capture one still image and send it with a prompt over the same WebSocket.
- Audio is transcribed locally with
faster-whisper. - The orchestrator sends the conversation to a local Ollama model first.
- If the model wants tools, the backend runs them and sends structured results back to the UI.
- If the local model cannot handle the prompt well enough, the app can escalate to a configured cloud provider.
- The backend emits:
- visible chat text
- structured media events for images, sounds, diagrams, and SVG
- a separate speech-only text path for TTS
- The frontend renders chat/media and, when available, uses the talking head to speak with browser-side TTS.
- If browser-side avatar speech is unavailable, the backend falls back to Piper or
say.
- Conversational chat with a kid-friendly system prompt
- Press-and-hold microphone button for voice input
- Separate display text vs speech text path so spoken output can be cleaner than on-screen text
- Server-side speech cleanup and normalization for units, punctuation, markdown, and UI-specific phrases
- Optional server-side phonetic generation for better HeadTTS pronunciation
search_images: shows image cards in the chat UIplay_sound: shows an inline audio player for sound clipscreate_diagram: creates Mermaid diagrams for explicit chart/flow/cycle requestsdraw_picture: creates sanitized inline SVG drawingsdo_math: solves simple math expressionsget_weather: fetches current weathertell_joke: returns kid-friendly jokes/riddlesfun_fact: returns fun facts by topic
- Browser camera button for capturing one still image
- Sends the captured photo to a vision-capable local model such as Gemma 4
- Supports child-friendly scene description such as visible objects, people counts, clothing colors, and simple activity descriptions
- Explicitly avoids person identification and sensitive-trait guessing
- Current flow is single-image, single-turn oriented rather than persistent visual memory across a long conversation
- Browser-side talking-head panel with a local avatar asset
- Default local avatar:
frontend/static/avatars/julia.glb - Configurable Kokoro/HeadTTS voice and TalkingHead avatar selection
- Browser-side speech prefers phonetic input when available
The app is local-model-first. OLLAMA_MODEL controls which local model is used.
Examples:
gpt-oss:20bgemma4:31bqwen3:30b
The local adapter now has model-family-specific handling for some model families, such as:
- prompt shaping
- response cleanup
- model-specific Ollama sampling options
Gemma 4 models are also used for the current still-photo vision flow. That path is intentionally simple:
- one captured image per request
- local-model-only
- no face recognition or person identification
- best suited to scene description, counting, and visible object/clothing details
kidschat/
├── backend/
│ ├── app.py
│ ├── orchestrator.py
│ ├── services/
│ │ ├── llm_local.py
│ │ ├── llm_cloud.py
│ │ ├── stt.py
│ │ ├── tts.py
│ │ ├── speech_normalizer.py
│ │ └── speech_phonemizer.py
│ └── tools/
│ ├── registry.py
│ ├── search.py
│ ├── sound.py
│ ├── picture.py
│ ├── diagram.py
│ └── fun.py
├── frontend/
│ ├── templates/
│ │ └── index.html
│ └── static/
│ ├── avatars/
│ ├── css/
│ └── js/
├── config/
│ └── env.example
├── tests/
├── requirements.txt
└── README.md
Using conda:
conda create -n kidschat python=3.11 -y
conda activate kidschat
pip install -r requirements.txtOr with a venv:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtcp config/env.example .envEdit .env as needed. The most important setting is:
OLLAMA_MODEL=gpt-oss:20bOther common examples:
OLLAMA_MODEL=gemma4:31b
TTS_ENGINE=auto
TALKING_HEAD_CHARACTER=julia
HEADTTS_INPUT_MODE=auto
PEXELS_API_KEY=...
FREESOUND_API_KEY=...Install Ollama, make sure it is running, and pull the model you want to use.
Examples:
ollama pull gpt-oss:20bollama pull gemma4:31bpython -m uvicorn backend.app:app --reload --host 127.0.0.1 --port 8000Open:
http://localhost:8000
See config/env.example for the current full set of options.
Important groups:
- local model:
OLLAMA_MODELOLLAMA_HOST
- media search:
PEXELS_API_KEYUNSPLASH_ACCESS_KEYFREESOUND_API_KEY
- STT:
WHISPER_MODEL
- server TTS:
TTS_ENGINEPIPER_VOICE
- browser talking head:
HEADTTS_VOICEHEADTTS_LANGUAGEHEADTTS_DICTIONARY_URLHEADTTS_INPUT_MODE
- avatar:
TALKING_HEAD_CHARACTERTALKING_HEAD_AVATAR_URLTALKING_HEAD_BODY
- speech preprocessing:
SPEECH_NORMALIZERHEADTTS_PHONEMIZER_USE_ESPEAK
- Desktop Chrome or Edge currently gives the best talking-head / browser TTS experience.
- The app still works without the avatar path, but falls back to backend audio.
- Microphone access must be granted in the browser.
- Camera access must be granted in the browser to use photo questions.
Pytest tests cover the main backend paths and selected frontend-adjacent behavior.
Run:
python -m pytest -qThis project is licensed under the Apache License 2.0. See LICENSE.
- Not a production deployment
- Not a child-safety moderation system
- Not an educational accuracy guarantee
- Not hardened against prompt attacks, persistent misuse, or determined abuse
- Not tuned for long unsupervised sessions
- Browser avatar/TTS path depends on modern desktop browser support
- Live media/data providers can fail, rate-limit, or return imperfect results
- Better multimodal support for local models with vision towers
- More deliberate kid-safe guardrails and supervision UX
- Better curated tool/data backends for children
- More avatars and voice choices
- More polished activity/animation around speech and listening