Inspiration
Approximately 430 million people around the world are deaf or hard of hearing, and the number is expected to grow by 63% to 700 million by 2050. Individuals who are d/Deaf or hard of hearing (DHH) report group conversations as being the most frequent context in which they encounter difficulties [1 , 2 ]. This is commonly called the "dinner table syndrome"- a syndrome where they are often left out of conversations and dismissed out of rapidly-changing topics- causing deaf individuals to feel ignored and disengaged from others.
Georgia Tech professor Thad Starner explored this topic in his research, studying captioned vs uncaptioned conversations that deaf individuals have ("Preferences for Captioning on Emulated Head Worn Displays While in Group Conversation"). The research demonstrated that captioned conversations had a significant impact over uncaptioned conversations for deaf people, but the captions were hard-coded beforehand just for comparison purposes. We wanted to bring it live in action.
What it does
Graption shows caption underneath the face of the person who is speaking to make it incredibly easy for the deaf person to read the text while also seeing the face emotions. Additionally, people always talk with a tone in their speech which is not immediately apparent by the faces. Hence, we show the tone of the voice. Additionally, we show the volume that a person is speaking at which is very helpful to feel involved and see who is screaming or whispering etc. Additionally, if anyone mentions the user, we show a pop-up to make it incredibly obvious that the people in the group meeting are addressing the user. At the end of the meeting, the user has access to our analytical dashboard. Here they can see the various previous meetings. They can check the transcript of a meeting, the participation of each person, and the tone and volume of each person. They can also check the highlights, the several tasks given out to different people and an AI summary. Additionally, the dashboard shows other past meetings that discussed topics mentioned in the meeting that the user is checking right now as well as a a summary of the key details mentioned in those meetings.
How we built it
Graption uses Apple’s built-in AR Vision to detect lip movements to identify the person who is speaking the text. For speech to text translation, it streams the ElevenLabs responses as chunks to the phone using web sockets. We fine-tuned Wav2Vec with the RAVDESS dataset (https://zenodo.org/records/1188976) for tonal analysis. This increased the accuracy a lot. We used Numpy and calculated the volume using rms = np.sqrt(np.mean(samples**2)) and normalized = min(1.0, rms * 5.0) to get the volume. We use Snowflake to store all our past meetings for the dashboard in their database. We also use Snowflake Cortex AI for a summary for the meeting. Additionally, we use RAG and Vector Databases in Snowflake to get similar past meetings and then use the Cortex AI to get the key details of these similar meetings.
Challenges we ran into
One of the biggest challenges was that Wav2Vec was not giving us great results. The model was very basic and almost always giving us Neutral. Hence, we had to fine tune the model. For this we first had to find the dataset and then fine tune the model all of which took a lot of time.
Accomplishments that we're proud of
We are proud of integrating so many different models like ARVision, Snowflake, ElevenLabs, and the fine tuned Wav2Vec. Using web sockets was extremely hard since we had no experience with that.
What we learned
We learned how to use web sockets to stream data from backend to frontend and frontend to backend. This was incredibly hard and we are proud of figuring this out. Additionally, this was the first time we fine tuned a model so we are incredibly proud of that.
What's next for Graption
Next for Graption, we’re focused on making it even more powerful and practical. We want to integrate individual microphones to improve noise reduction and speaker diarization, so captions are cleaner and more accurate in real-world settings. We’re also planning to work closely with Thad Starner to refine the system and explore greater research-backed improvements. Finally, we aim to extend Graption to wearable glasses and using handheld GPUs for handling the AI models, bringing real-time accessibility and augmented captioning beyond the phone and into everyday life.


Log in or sign up for Devpost to join the conversation.