Inspiration
There has been an explosion in content about voice synthesizers in the music industry, and also a growing security concern about the deepfakes in video and audio. Seeing this, I asked myself, if the cloning model can understand what a person voice sounds like by constructing a clone of it, then surely, we can reverse engineer it.
What it does
Takes in audio input and gives an identity prediction (~85% accuracy rate)
How we built it
Built a classifier that processes audio into an image of a mel spectogram and trains the image of the spec represention with a label of the person who said the audio as an image classifier
Challenges we ran into
There were plenty of issues in building a model and dataset from scratch with overfitting and accuracy. We also ran into alot of infrastructure issues in submitting audio over the web -> API.
Accomplishments that we're proud of
Building an entire model with 85%+ accuracy with no experience in doing all of this in scratch
What we learned
Model engineering practices, it is so important to build in an iterative manner instead of just throwing all of your data at it.
What's next for VoID
- Speech embeddings and see doc, there are so many other implementations for voice identity! (sorry only 1 min left to post) Such as...
- Security
- Dev Tools
- Accessibility
- Audio Discovery
- Language


Log in or sign up for Devpost to join the conversation.