Spatial-Speech Isolation & Captioning

Inspiration

Our primary motivation was looking at ways to improve accessibility for the hearing-impaired. One big general problem that they face is general socialization, and many report feeling isolated completely. There is no practical channel of communication for people who don't speak sign languages and are fully deaf. As such, the most natural solution that came to mind was general captioning in a live, conversational context for multiple agents. Our initial inspiration was the sci-fi idea of having overhead subtitles for every conversation in real life - with self-translating captions if you're speaking with someone in a different language! Delving into the academia and existing research of the field, we discovered the cocktail party problem: a classic problem at the intersection of physics, mathematics and CS - given mixed audio from multiple sources (like a party), isolate the voice of each speaker. There has been considerable progress, in academia and applications, towards solving it, but only with a specific approach that's nearing its limit - spectral decomposition of audio. It uses the fact that different voices, music, and other audio objects produce audio with fundamentally different physical characteristics (like amplitude and frequency) that can be used to separate them. However, it is a far cry from true spatial isolation that can fulfill our vision - so we set out to create a practical method to spatially isolate audio sources and then caption each source in real time!

What it does

The implementation consists of 2 parts: an Android app and a spatially-aware robot mounted on an RC car. The robot on the car contains 2 microphones that spatially identify the source of audio, i.e, where a particular speaker is relative to it. A phone running the app is mounted on the car. After the robot spatially isolates the speaker, it rotates the phone towards them, and then the app listens to their speech and transcribes it to text captions correctly positioned relative to the speaker.

How we built it

The app was built using Google cloud resources for speech-to-text translation, and Android Studio to build the app. The robot uses an Arduino, 2 microphones, and other required circuitry to do a bunch of math that compares the different audio received by the 2 different microphones to identify the position of a speaker and rotate the phone towards them.

Challenges we ran into

Our initial biggest hurdle was the sea of academic discourse to sort through. After catching up to modern research, we delved into implementing a model theorized literally 20 days ago (ViVoT) - using a Deep Learning model to create an AR phone app that does spatial audio isolation with the camera and phone microphones to overlay live captions above each speaker. However, halfway through implementing the paper's model, we realized that the paper didn't provide all the necessary details to implement it - and on emailing its author, we got a confirmation that it would only be fully released after the paper was accepted by the journal and published. We then decided to pivot to our current robot + app as an alternative mechanical/ECE solution for spatial isolation similar to older papers we had read, instead of a pure machine learning one. While working on the robot, Arduino's sampling rate is actually too low to use 2 microphones to normally identify the source of a particular sound; we ultimately had to overclock it to make it work!

Accomplishments that we're proud of

Integrating the Google cloud NLP RNN model for speech-to-text transcription and captioning into the app natively on Android was a challenging task of combining multiple tech stacks together. The process of debugging the Arduino and microphones was painstaking, but ultimately a happy achievement.

What's next for Spatial Source Speech Captioning

We are still in touch with Dr. Montesinos to pursue applying his Deep Learning model as an AR app once the paper is fully published. The scope of spatial audio isolation for captioning is limitless - hearing aid systems/speech enhancement, noise control: military and industrial, noise pollution, audio surveillance & acoustic signal processing), automatic music transcription, and of course, AR Glasses for live overhead subtitle captioning are all known and developed use cases for this technology that we are excited to further delve into in the future!

Check out our Github for the source code and check the video link for a tech demo!