KidsChat

KidsChat is a local-first prototype web app for demonstrating AI to children in a playful, voice-friendly way. It combines a browser chat UI, local Ollama-hosted models, speech input/output, a talking-head avatar, and a small set of fun tools like pictures, sounds, jokes, math, and weather.

Prototype Notice

This is a prototype/demo application.

It is intended to demo AI to children in a fun setting with an adult present.
It has not been tested, safety-reviewed, or hardened for extensive 1-on-1 interaction with children.
It should not be treated as a child-safety product, tutoring product, or unsupervised companion app.
Tool outputs come from local models plus live third-party data sources and can still be wrong, awkward, or inappropriate in edge cases.

If you use this with children, do it with adult supervision.

What It Does

Text input plus push-to-talk microphone input
Local speech-to-text with faster-whisper
Local LLM responses through Ollama
Optional cloud escalation to Claude, OpenAI, or Gemini when API keys are configured
Browser-rendered talking-head avatar with HeadTTS + TalkingHead
Local/server fallback TTS with Piper or macOS say
Still-photo camera capture for asking a vision-capable local model about an image
Tool calling for:
- image search
- sound search/playback
- kid-friendly jokes and facts
- weather
- math
- diagrams
- simple SVG drawings

Tech Stack

Backend

Python 3.11+
FastAPI
WebSockets
Ollama Python client
faster-whisper for STT
Piper TTS and macOS say fallback
Optional NeMo text normalization
Optional Misaki + phonemizer/eSpeak phonetic preprocessing for HeadTTS

Frontend

Plain HTML, CSS, and vanilla JavaScript
Mermaid for diagrams
HeadTTS for browser-side speech
TalkingHead + Three.js for the avatar

External Services / Data Sources

Ollama for local model hosting
Open-Meteo for weather
Pexels / Unsplash / Openverse for images
Freesound / Openverse audio for sound clips
Optional Claude / OpenAI / Gemini cloud fallback

High-Level Architecture

The browser sends text or recorded audio to the backend over a WebSocket.
For photo questions, the browser can capture one still image and send it with a prompt over the same WebSocket.
Audio is transcribed locally with faster-whisper.
The orchestrator sends the conversation to a local Ollama model first.
If the model wants tools, the backend runs them and sends structured results back to the UI.
If the local model cannot handle the prompt well enough, the app can escalate to a configured cloud provider.
The backend emits:
- visible chat text
- structured media events for images, sounds, diagrams, and SVG
- a separate speech-only text path for TTS
The frontend renders chat/media and, when available, uses the talking head to speak with browser-side TTS.
If browser-side avatar speech is unavailable, the backend falls back to Piper or say.

Current Feature Set

Chat and Voice

Conversational chat with a kid-friendly system prompt
Press-and-hold microphone button for voice input
Separate display text vs speech text path so spoken output can be cleaner than on-screen text
Server-side speech cleanup and normalization for units, punctuation, markdown, and UI-specific phrases
Optional server-side phonetic generation for better HeadTTS pronunciation

Media and Interactive Tools

search_images: shows image cards in the chat UI
play_sound: shows an inline audio player for sound clips
create_diagram: creates Mermaid diagrams for explicit chart/flow/cycle requests
draw_picture: creates sanitized inline SVG drawings
do_math: solves simple math expressions
get_weather: fetches current weather
tell_joke: returns kid-friendly jokes/riddles
fun_fact: returns fun facts by topic

Photo Questions

Browser camera button for capturing one still image
Sends the captured photo to a vision-capable local model such as Gemma 4
Supports child-friendly scene description such as visible objects, people counts, clothing colors, and simple activity descriptions
Explicitly avoids person identification and sensitive-trait guessing
Current flow is single-image, single-turn oriented rather than persistent visual memory across a long conversation

Talking Head

Browser-side talking-head panel with a local avatar asset
Default local avatar: frontend/static/avatars/julia.glb
Configurable Kokoro/HeadTTS voice and TalkingHead avatar selection
Browser-side speech prefers phonetic input when available

Local Model Notes

The app is local-model-first. OLLAMA_MODEL controls which local model is used.

Examples:

gpt-oss:20b
gemma4:31b
qwen3:30b

The local adapter now has model-family-specific handling for some model families, such as:

prompt shaping
response cleanup
model-specific Ollama sampling options

Gemma 4 models are also used for the current still-photo vision flow. That path is intentionally simple:

one captured image per request
local-model-only
no face recognition or person identification
best suited to scene description, counting, and visible object/clothing details

Project Layout

kidschat/
├── backend/
│   ├── app.py
│   ├── orchestrator.py
│   ├── services/
│   │   ├── llm_local.py
│   │   ├── llm_cloud.py
│   │   ├── stt.py
│   │   ├── tts.py
│   │   ├── speech_normalizer.py
│   │   └── speech_phonemizer.py
│   └── tools/
│       ├── registry.py
│       ├── search.py
│       ├── sound.py
│       ├── picture.py
│       ├── diagram.py
│       └── fun.py
├── frontend/
│   ├── templates/
│   │   └── index.html
│   └── static/
│       ├── avatars/
│       ├── css/
│       └── js/
├── config/
│   └── env.example
├── tests/
├── requirements.txt
└── README.md

Quick Start

1. Create an environment

Using conda:

conda create -n kidschat python=3.11 -y
conda activate kidschat
pip install -r requirements.txt

Or with a venv:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Configure environment variables

cp config/env.example .env

Edit .env as needed. The most important setting is:

OLLAMA_MODEL=gpt-oss:20b

Other common examples:

OLLAMA_MODEL=gemma4:31b
TTS_ENGINE=auto
TALKING_HEAD_CHARACTER=julia
HEADTTS_INPUT_MODE=auto
PEXELS_API_KEY=...
FREESOUND_API_KEY=...

3. Install and run Ollama

Install Ollama, make sure it is running, and pull the model you want to use.

Examples:

ollama pull gpt-oss:20b

ollama pull gemma4:31b

4. Run the app

python -m uvicorn backend.app:app --reload --host 127.0.0.1 --port 8000

Open:

http://localhost:8000

Configuration

See config/env.example for the current full set of options.

Important groups:

local model:
- OLLAMA_MODEL
- OLLAMA_HOST
media search:
- PEXELS_API_KEY
- UNSPLASH_ACCESS_KEY
- FREESOUND_API_KEY
STT:
- WHISPER_MODEL
server TTS:
- TTS_ENGINE
- PIPER_VOICE
browser talking head:
- HEADTTS_VOICE
- HEADTTS_LANGUAGE
- HEADTTS_DICTIONARY_URL
- HEADTTS_INPUT_MODE
avatar:
- TALKING_HEAD_CHARACTER
- TALKING_HEAD_AVATAR_URL
- TALKING_HEAD_BODY
speech preprocessing:
- SPEECH_NORMALIZER
- HEADTTS_PHONEMIZER_USE_ESPEAK

Browser Notes

Desktop Chrome or Edge currently gives the best talking-head / browser TTS experience.
The app still works without the avatar path, but falls back to backend audio.
Microphone access must be granted in the browser.
Camera access must be granted in the browser to use photo questions.

Testing

Pytest tests cover the main backend paths and selected frontend-adjacent behavior.

Run:

python -m pytest -q

License

This project is licensed under the Apache License 2.0. See LICENSE.

Non-Goals / Limitations

Not a production deployment
Not a child-safety moderation system
Not an educational accuracy guarantee
Not hardened against prompt attacks, persistent misuse, or determined abuse
Not tuned for long unsupervised sessions
Browser avatar/TTS path depends on modern desktop browser support
Live media/data providers can fail, rate-limit, or return imperfect results

Future Directions

Better multimodal support for local models with vision towers
More deliberate kid-safe guardrails and supervision UX
Better curated tool/data backends for children
More avatars and voice choices
More polished activity/animation around speech and listening

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KidsChat

Prototype Notice

What It Does

Tech Stack

Backend

Frontend

External Services / Data Sources

High-Level Architecture

Current Feature Set

Chat and Voice

Media and Interactive Tools

Photo Questions

Talking Head

Local Model Notes

Project Layout

Quick Start

1. Create an environment

2. Configure environment variables

3. Install and run Ollama

4. Run the app

Configuration

Browser Notes

Testing

License

Non-Goals / Limitations

Future Directions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
backend		backend
config		config
frontend		frontend
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

KidsChat

Prototype Notice

What It Does

Tech Stack

Backend

Frontend

External Services / Data Sources

High-Level Architecture

Current Feature Set

Chat and Voice

Media and Interactive Tools

Photo Questions

Talking Head

Local Model Notes

Project Layout

Quick Start

1. Create an environment

2. Configure environment variables

3. Install and run Ollama

4. Run the app

Configuration

Browser Notes

Testing

License

Non-Goals / Limitations

Future Directions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages