Inspiration
The inspiration for HearHere came from conversations with members of the Deaf and Hard-of-Hearing community. Many described ongoing difficulties with movies, TV shows, and group conversations despite the availability of closed captions. While captions display spoken words and sound effects, they often lack directionality, nuance, and clarity—especially when multiple people speak at once or sounds overlap.
This highlighted a fundamental issue: captions provide what is said, but not who is saying it or where the sound originates. From this insight, HearHere was born—a system that makes captions spatially informative by attaching spoken words directly to the speaker through on-screen speech bubbles.
As development progressed, the project expanded and contracted in scope. While the initial idea was broad, we ultimately focused on solving a core challenge well: identifying individual speakers, tracking them over time, and visually displaying their speech in a clear, intuitive way.
What it does
HearHere is a real-time, speaker-aware closed captioning system for audio-visual media. Using a webcam and microphone, it detects and tracks faces, transcribes speech, and displays each person’s words in a speech bubble anchored above their face. This restores the spatial context of conversation, making it easier to follow who is speaking—especially in multi-speaker scenarios.
In future versions, these speech bubbles could also encode emotional or tonal information to further enrich communication.
How we built it
Using Cursor we implemented: Active Speaker Detection modules on Python namely pyannote.audio and a second method: pretrained TalkNet a deep neural network Pytorch for diarization mode, and detecting the speaker (ASD), FaceAPI AI-powered Face Detection & Rotation Tracking designed for NodeJS through TensorFlow/JS Browser APIs like getUserMedia, MediaRecorder, Web Audio API to gather needed data and present the AR The final version is build locally through Cuda but also Huggingface api has also been implemented as a fallback.
After developing a working model by hand, we wanted to see how well the we could build it with AI assitance models. This led us to utilyzing OpenAi's Codex, Google's Gemini, and Anthropic's Claude models simultaneously to draft, design and build a working version of our project.
Challenges we ran into
Robustly detecting the active speaker and to show the bubble in the right location was challenging. One particular issue is that two members of our team have low voices and beards making lip tracking inconsistent, and voice recognition difficult.
Accomplishments that we're proud of
We are proud that the system is now working and can be a first step toward many next iterations. This has proven the applicability of such device especially if paired with an in glasses displays.
What we learned
This has been our first foray into the use of GenerativeAI for coding. We utilized multiple types of models to assist in various parts of the development cycle, which allowed us to learn the specific strengths and weakness of each model. We initially designed the code by hand, however after developing a working model decided to test how a series of GenerativeAI models (Codex, Gemini, and Claude) working in synchronicity would perform on replicating the project from scratch. While building the project by hand and developing with the Models took roughly the same amount of time, utilizing the assistant coders drastically decreased the mental load and increased level of professionalism in the project. (Attached in the Try it Out Links is both the hand made version and the AIMadeVerison)
In addirion to learning how to operate the AI Models, we also learned more about how Devstreams work. Our code has always been intended to run on one, maybe two computers, however this hackathon made us learn how to utilize GitHub, tracking changes, and implimenting devlogs to ensure collaboration between mulitple groups with no verbal instructions.
What's next for HearHere
Next steps for HearHere include improving speaker attribution accuracy, expanding support for overlapping speech, and refining the user experience. Longer-term goals include emotion encoding, multilingual translation, and extending the system to augmented reality environments, where spatially grounded captions could become a natural part of everyday communication.
Built With
- css
- faceapi
- faster-whisper
- html
- huggingface
- javascript
- python
- pytorch
- talknet
- tensorflow
Log in or sign up for Devpost to join the conversation.