AAVA | Devpost

AAVA proof-of-concept interface, featuring visualization and privacy enhanced camera view.
Flow chart for the AAVA wake decision flow.

Inspiration

The future of human-computer interaction has long excited the public imagination with visions of humans conversing with their digital assistants as near equals.

Although we have made great strides in tackling the monumental problems of computer speech understanding, I noticed some seemingly small but highly impactful user-experience hiccups that demand attention before we can approach the aforementioned “live” voice interfaces of science fiction.

A common problem with voice assistants like the Google Home and Amazon Echo is an occasional but annoying irregularity in responding to voice commands.

Sometimes, these devices don’t wake up on the first call, requiring repeated calls that run counter to the ease and convenience voice interfaces are supposed to provide.
On the other hand, sometimes these devices are too zealous in responding to phrases that sound like their wake word, which can cause them to interrupt conversations, and even pose a risk to user privacy through unwanted listening.

These user-experience hiccups are so important because, while deep speech understanding works essentially like magic to the user, these simple mistakes that no human would make completely unveil the illusion, reminding us that we’re just talking to a piece of electronics.

The problem seems to stem from a heavy reliance on the wake word. While a wake word is convenient to summon a piece of electronics, it’s not entirely ideal when it comes to conversing with our digital assistants. Humans don’t start each sentence with the name of whom they are addressing. Developing a voice UI that can more naturally determine when the user wishes to address it will not only solve the hiccups mentioned before, but it will also move us a step closer towards the true “live” voice interfaces of the future.

What it does

AAVA, or Audience-Aware Voice Assistant augments traditional voice UIs with computer vision, allowing it to get an idea of the audience present. The idea is to understand from the combination of vision perception and sound stimulus when it is actually being addressed, and when it is likely a phrase is directed towards someone else.

Here’s how AAVA works:

AAVA locally measures ambient volume until it hears a louder sound which may indicate that it’s being addressed. No wake word is required.

The camera gathers some low-resolution vision data and uses machine vision to count the people present in the room. The vision data is heavily blurred locally before it leaves the device so that no private information is transmitted; all it needs to do is to count human figures.

If there is just one person in the room, then AAVA guesses that it is being addressed. So it wakes up and starts listening, now responding just like our assistants do today.

If there are multiple people in the room, the behavior is more interesting. AAVA listens for a second sound which may indicate that someone is responding. This is all done locally, without needing to “understand” the content of the sound or to send it to the cloud. If it notices that there is no response, then it guesses that the first sound might have been a question addressed to it, so it politely asks for the user to confirm that they want it to be a part of the conversation.

How I built it

Python - multi-threading, audio processing, GUI
Houndify - Speech understanding, Voice Assistant responses
Google Cloud Vision API - Counting people present in low-resolution privacy-enhanced images.
Google Cloud Speech API - Pleasant TTS to give the assistant a voice.

(More coming soon.)

Challenges I ran into

Some difficulties with noise handling.
Internet and computer lag.

Accomplishments that I'm proud of

Putting together a working proof-of-concept by integrating many emerging technologies.

What I learned

Google Cloud Vision API is incredible! It can recognize objects even when heavily blurred!

What's next for AAVA

Software improvements: Deep understanding of audio and visual stimuli; Q-Learning to inform wake
Hardware implementation, with privacy designed into the device (blurred camera lens, electronic microphone switch-off)

Built With

google-cloud-tts-api
google-cloud-vision-api
houndify
opencv
pysimplegui
python

Updates

George Moe started this project — Feb 16, 2020 09:34 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.