Skip to content

BehtomAdeli/HearHere

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Live Speech Bubbles – MVP

Versioning

  • v0.1: baseline single-transcript behavior.
  • v0.2 (current): per-person transcript isolation, per-person bubble colors, eye-safe bubble placement, and speaker assignment that combines voiceprint (F1/F2/F3-style spectral peaks) with lip movement as secondary check.

Camera + face detection + voice detection + speech-to-text, with transcripts shown in a bubble above your head.

Prompt & roadmap

  • Product prompt: PROMPT.md – what the app should do.
  • Roadmap: ROADMAP.md – phases and MVP steps.

Run the MVP

  1. Frontend – use a local server (required for camera/mic and face-api models):
    npx serve .
    Or: python -m http.server 8080
  2. Backend (required for Whisper transcription + AI language detection/translation):
    cd server
    npm install
    # Put your key in server/.env:
    # OPENAI_API_KEY=sk-...
    # Optional for speaker diarization (pyannote):
    # HF_TOKEN=hf_...
    # (or use environment variable)
    set OPENAI_API_KEY=sk-...   # Windows
    set HF_TOKEN=hf_...         # Windows (optional)
    # export OPENAI_API_KEY=sk-...   # Mac/Linux
    # export HF_TOKEN=hf_...         # Mac/Linux (optional)
    npm start
    The API runs on http://localhost:3001. With OPENAI_API_KEY set, the app sends audio chunks to the backend: it transcribes with Whisper, detects language, and if it’s not English returns an English translation.
    If HF_TOKEN is set, backend also calls Hugging Face diarization and returns a speaker label hint to improve two-person separation.
    Default primary model endpoint is pyannote/speaker-diarization-3.1 and fallback is hicustomer/pyannote-speaker-diarization (configurable with HF_DIARIZATION_URL and HF_DIARIZATION_FALLBACK_URL).

Local pyannote mode (recommended for your use case)

This uses the same flow as:

  • Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=...)
  • diarization = pipeline("audio.wav", num_speakers=2)

Setup:

cd server
pip install -r requirements-pyannote.txt

In server/.env:

PYANNOTE_LOCAL=1
PYTHON_BIN=python
PYANNOTE_NUM_SPEAKERS=2

TalkNet-ASD mode (audio-visual active speaker)

The backend can run local TalkNet-ASD inference per short camera clip and return the active speaker face hint.

Setup:

cd server
git clone https://github.com/TaoRuijie/TalkNet-ASD.git
# then install TalkNet dependencies inside your python env

In server/.env:

TALKNET_LOCAL=1
TALKNET_REPO_PATH=./TalkNet-ASD
TALKNET_PYTHON_BIN=python
  1. Open Chrome at the frontend URL (e.g. http://localhost:3000).
  2. Optionally choose Translate to to show bubbles in another language.
  3. Click Start camera & mic, allow camera and microphone. If the backend is up and has an API key, status will say “Speech: AI (auto language)”. Speak in any language; non-English is auto-detected and shown in English (and in “Translate to” language if set). Detected shows the detected language.

Requirements

  • Chrome (recommended): full support for Web Speech API and camera/mic.
  • HTTPS or localhost: required for getUserMedia and Speech Recognition.
  • Internet connection: Chrome’s speech recognition sends audio to Google’s servers. If you see “Speech error: network”, check your connection and that firewalls/proxies aren’t blocking Google.
  • Camera and microphone.

MVP features

Feature Implementation
Video getUserMedia<video>
Face detection face-api.js (TinyFaceDetector)
Voice detection Volume-based VAD (AnalyserNode)
Speech-to-text Web Speech API (SpeechRecognition)
Bubbles Up to 4 people: one bubble per face; active speaker from mouth openness (face landmarks); each person’s dialogue in their own bubble
Language Auto: voice is sent to the AI backend; language is detected and non-English is translated to English. Translate to: optional extra translation to another language.
Backend API server/: POST /api/transcribe (audio → Whisper + detect + translate to EN), POST /api/detect-language, POST /api/translate. Requires OPENAI_API_KEY for AI transcription and auto language.

Project layout

NPC/
├── PROMPT.md    # Product prompt
├── ROADMAP.md   # Development roadmap
├── README.md    # This file
├── index.html   # App shell
├── styles.css   # Layout and bubble styling
├── app.js       # Camera, face detection, VAD, STT, bubbles, language UI
└── server/      # Optional backend for language detection + translation
    ├── package.json
    ├── server.js
    └── .env.example

Next steps (see ROADMAP.md)

  • Phase 2: Better bubble styling, multiple faces, errors.
  • Phase 3: Multi-speaker (who said what).
  • Phase 4: Deploy, optional cloud STT.

About

HearHere speech bubbles app

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors