v0.1: baseline single-transcript behavior.v0.2(current): per-person transcript isolation, per-person bubble colors, eye-safe bubble placement, and speaker assignment that combines voiceprint (F1/F2/F3-style spectral peaks) with lip movement as secondary check.
Camera + face detection + voice detection + speech-to-text, with transcripts shown in a bubble above your head.
- Product prompt: PROMPT.md – what the app should do.
- Roadmap: ROADMAP.md – phases and MVP steps.
- Frontend – use a local server (required for camera/mic and face-api models):
Or:
npx serve .python -m http.server 8080 - Backend (required for Whisper transcription + AI language detection/translation):
The API runs on
cd server npm install # Put your key in server/.env: # OPENAI_API_KEY=sk-... # Optional for speaker diarization (pyannote): # HF_TOKEN=hf_... # (or use environment variable) set OPENAI_API_KEY=sk-... # Windows set HF_TOKEN=hf_... # Windows (optional) # export OPENAI_API_KEY=sk-... # Mac/Linux # export HF_TOKEN=hf_... # Mac/Linux (optional) npm start
http://localhost:3001. WithOPENAI_API_KEYset, the app sends audio chunks to the backend: it transcribes with Whisper, detects language, and if it’s not English returns an English translation.
IfHF_TOKENis set, backend also calls Hugging Face diarization and returns a speaker label hint to improve two-person separation.
Default primary model endpoint ispyannote/speaker-diarization-3.1and fallback ishicustomer/pyannote-speaker-diarization(configurable withHF_DIARIZATION_URLandHF_DIARIZATION_FALLBACK_URL).
This uses the same flow as:
Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=...)diarization = pipeline("audio.wav", num_speakers=2)
Setup:
cd server
pip install -r requirements-pyannote.txtIn server/.env:
PYANNOTE_LOCAL=1
PYTHON_BIN=python
PYANNOTE_NUM_SPEAKERS=2The backend can run local TalkNet-ASD inference per short camera clip and return the active speaker face hint.
Setup:
cd server
git clone https://github.com/TaoRuijie/TalkNet-ASD.git
# then install TalkNet dependencies inside your python envIn server/.env:
TALKNET_LOCAL=1
TALKNET_REPO_PATH=./TalkNet-ASD
TALKNET_PYTHON_BIN=python- Open Chrome at the frontend URL (e.g.
http://localhost:3000). - Optionally choose Translate to to show bubbles in another language.
- Click Start camera & mic, allow camera and microphone. If the backend is up and has an API key, status will say “Speech: AI (auto language)”. Speak in any language; non-English is auto-detected and shown in English (and in “Translate to” language if set). Detected shows the detected language.
- Chrome (recommended): full support for Web Speech API and camera/mic.
- HTTPS or localhost: required for
getUserMediaand Speech Recognition. - Internet connection: Chrome’s speech recognition sends audio to Google’s servers. If you see “Speech error: network”, check your connection and that firewalls/proxies aren’t blocking Google.
- Camera and microphone.
| Feature | Implementation |
|---|---|
| Video | getUserMedia → <video> |
| Face detection | face-api.js (TinyFaceDetector) |
| Voice detection | Volume-based VAD (AnalyserNode) |
| Speech-to-text | Web Speech API (SpeechRecognition) |
| Bubbles | Up to 4 people: one bubble per face; active speaker from mouth openness (face landmarks); each person’s dialogue in their own bubble |
| Language | Auto: voice is sent to the AI backend; language is detected and non-English is translated to English. Translate to: optional extra translation to another language. |
| Backend API | server/: POST /api/transcribe (audio → Whisper + detect + translate to EN), POST /api/detect-language, POST /api/translate. Requires OPENAI_API_KEY for AI transcription and auto language. |
NPC/
├── PROMPT.md # Product prompt
├── ROADMAP.md # Development roadmap
├── README.md # This file
├── index.html # App shell
├── styles.css # Layout and bubble styling
├── app.js # Camera, face detection, VAD, STT, bubbles, language UI
└── server/ # Optional backend for language detection + translation
├── package.json
├── server.js
└── .env.example
- Phase 2: Better bubble styling, multiple faces, errors.
- Phase 3: Multi-speaker (who said what).
- Phase 4: Deploy, optional cloud STT.