Inspiration
After a truly powerful keynote presentation from Swetha Machanavajhala we decided to focus on making a application for the hard of hearing. During the keynote, Swetha stated that she feels as though typical speech to text systems take away from conversations because they force people to read the words rather than focus on the conversation. We made an app to solve that problem.
What it does
SeenAndHeard transcribes live conversations into text and displays the text conveniently beside the respective speaker using augmented reality. It allows the user to be able to focus on the persons face and the conversation at the same time.
How we built it
We built the application using Unity, and an augmented reality framework called Vuforia. To process the speech we used Microsoft Cognitive Services. The entire app was built using C#.
Challenges we ran into
Audio processing is very difficult and its really easy to pick up unwanted background noise. You also have to weigh the speed of the translation with the quality. We found that our app had a good ratio of correctness to speed. In the end we wish we could have done more to speed up the text processing but we only had one working laptop and had issues getting intermediate results from the Microsoft API.
Accomplishments that we're proud of
Its hard to get a application up and running in just 24 hours and we feel as though we put together an amazing responsive application with lasting impact. We were able to link together many different technologies into a seamless Augmented Reality Experience. Also, neither of us had ever used any of the technologies before so we had a awesome chance to learn about areas that we are not familiar with.
What we learned
We learned a lot about Augmented Reality and Machine Learning. We also were able to learn about image tagging and what it takes for a image to have a good trackable shape.
What's next for SeenAndHeard
We would love to continue to develop this project into a full fledged application that has a better response rate and can filter people based on their voices and locations. For example, in a basic speech to text application, if two individuals were speaking at the same time, words would overlap, and the text would look like gibberish. SeenAndHeard has the potential to eliminate this issue by giving each speaker a separate stream of text. Also, for a truly seamless experience, the next step would be to port our application to the hololens so the user would not need to have their phone out all the time.
Built With
- azure
- bing-speech
- machine-learning
- unity
- vuforia

Log in or sign up for Devpost to join the conversation.