Live Speech Bubbles – MVP

Versioning

v0.1: baseline single-transcript behavior.
v0.2 (current): per-person transcript isolation, per-person bubble colors, eye-safe bubble placement, and speaker assignment that combines voiceprint (F1/F2/F3-style spectral peaks) with lip movement as secondary check.

Camera + face detection + voice detection + speech-to-text, with transcripts shown in a bubble above your head.

Prompt & roadmap

Product prompt: PROMPT.md – what the app should do.
Roadmap: ROADMAP.md – phases and MVP steps.

Run the MVP

Frontend – use a local server (required for camera/mic and face-api models):
```
npx serve .
```
Or: python -m http.server 8080
Backend (required for Whisper transcription + AI language detection/translation):
```
cd server
npm install
# Put your key in server/.env:
# OPENAI_API_KEY=sk-...
# Optional for speaker diarization (pyannote):
# HF_TOKEN=hf_...
# (or use environment variable)
set OPENAI_API_KEY=sk-...   # Windows
set HF_TOKEN=hf_...         # Windows (optional)
# export OPENAI_API_KEY=sk-...   # Mac/Linux
# export HF_TOKEN=hf_...         # Mac/Linux (optional)
npm start
```
The API runs on http://localhost:3001. With OPENAI_API_KEY set, the app sends audio chunks to the backend: it transcribes with Whisper, detects language, and if it’s not English returns an English translation.
If HF_TOKEN is set, backend also calls Hugging Face diarization and returns a speaker label hint to improve two-person separation.
Default primary model endpoint is pyannote/speaker-diarization-3.1 and fallback is hicustomer/pyannote-speaker-diarization (configurable with HF_DIARIZATION_URL and HF_DIARIZATION_FALLBACK_URL).

Local pyannote mode (recommended for your use case)

This uses the same flow as:

Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=...)
diarization = pipeline("audio.wav", num_speakers=2)

Setup:

cd server
pip install -r requirements-pyannote.txt

In server/.env:

PYANNOTE_LOCAL=1
PYTHON_BIN=python
PYANNOTE_NUM_SPEAKERS=2

TalkNet-ASD mode (audio-visual active speaker)

The backend can run local TalkNet-ASD inference per short camera clip and return the active speaker face hint.

Setup:

cd server
git clone https://github.com/TaoRuijie/TalkNet-ASD.git
# then install TalkNet dependencies inside your python env

In server/.env:

TALKNET_LOCAL=1
TALKNET_REPO_PATH=./TalkNet-ASD
TALKNET_PYTHON_BIN=python

Open Chrome at the frontend URL (e.g. http://localhost:3000).
Optionally choose Translate to to show bubbles in another language.
Click Start camera & mic, allow camera and microphone. If the backend is up and has an API key, status will say “Speech: AI (auto language)”. Speak in any language; non-English is auto-detected and shown in English (and in “Translate to” language if set). Detected shows the detected language.

Requirements

Chrome (recommended): full support for Web Speech API and camera/mic.
HTTPS or localhost: required for getUserMedia and Speech Recognition.
Internet connection: Chrome’s speech recognition sends audio to Google’s servers. If you see “Speech error: network”, check your connection and that firewalls/proxies aren’t blocking Google.
Camera and microphone.

MVP features

Feature	Implementation
Video	`getUserMedia` → `<video>`
Face detection	face-api.js (TinyFaceDetector)
Voice detection	Volume-based VAD (AnalyserNode)
Speech-to-text	Web Speech API (`SpeechRecognition`)
Bubbles	Up to 4 people: one bubble per face; active speaker from mouth openness (face landmarks); each person’s dialogue in their own bubble
Language	Auto: voice is sent to the AI backend; language is detected and non-English is translated to English. Translate to: optional extra translation to another language.
Backend API	`server/`: `POST /api/transcribe` (audio → Whisper + detect + translate to EN), `POST /api/detect-language`, `POST /api/translate`. Requires `OPENAI_API_KEY` for AI transcription and auto language.

Project layout

NPC/
├── PROMPT.md    # Product prompt
├── ROADMAP.md   # Development roadmap
├── README.md    # This file
├── index.html   # App shell
├── styles.css   # Layout and bubble styling
├── app.js       # Camera, face detection, VAD, STT, bubbles, language UI
└── server/      # Optional backend for language detection + translation
    ├── package.json
    ├── server.js
    └── .env.example

Next steps (see ROADMAP.md)

Phase 2: Better bubble styling, multiple faces, errors.
Phase 3: Multi-speaker (who said what).
Phase 4: Deploy, optional cloud STT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Live Speech Bubbles – MVP

Versioning

Prompt & roadmap

Run the MVP

Local pyannote mode (recommended for your use case)

TalkNet-ASD mode (audio-visual active speaker)

Requirements

MVP features

Project layout

Next steps (see ROADMAP.md)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
server		server
PROMPT.md		PROMPT.md
README.md		README.md
ROADMAP.md		ROADMAP.md
app.js		app.js
index.html		index.html
styles.css		styles.css

Folders and files

Latest commit

History

Repository files navigation

Live Speech Bubbles – MVP

Versioning

Prompt & roadmap

Run the MVP

Local pyannote mode (recommended for your use case)

TalkNet-ASD mode (audio-visual active speaker)

Requirements

MVP features

Project layout

Next steps (see ROADMAP.md)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages