Vocera

Landing Page
Deep-Fake Classification
Model Training Section

Inspiration

With the rise of generative AI, voice deepfakes have become nearly indistinguishable from real speech—leading to real-world harm. A close friend’s mother was tricked into wiring money to someone impersonating her son. Incidents like this underscore the urgent need for trustworthy voice authentication during sensitive calls. Moreover, recent voice models like ElevenLabs' v3 have become remarkably close to actual speaking patterns, overcoming the Uncanny Valley, while few algorithms exist to identify these deepfaked voices.

What it does

Vocera (vo-cher-uh) is a mobile app that verifies a caller's identity using their unique vocal signature—designed specifically for high-risk phone calls, such as when a friend or family member asks for money, gift cards, or urgent help like bail.

To confirm their identity, the caller must speak a secret passphrase known as a Vox Key—a private phrase that's grammatically nonsensical with a contrasting tone, like "The grieving dogs are raining birds," enthusiastically. The unusual structure, and the distinct way each person says it, makes it difficult for deepfake models to mimic. Vocera then authenticates the caller through a three-step process:

Passphrase check – Is the correct Vox Key textual phrase being spoken?
Speaker verification – Does the voice resemble the registered user's voice?
Deepfake detection – Are there signs the voice was AI-generated?

In the interest of maximum security, this Vox Key captures your vocal signature—tone, inflection, and other features unique to you. After each use, the Vox Key must be replaced by recording ten ~5 second voice clips to capture low-level vocal traits such as shimmer, inflection, and pause distribution.

If the verification passes, you know you’re speaking with the real person—not a cloned voice.

How we built it

Vocera is built with React Native and Expo for cross-platform deployment, styled with NativeWind, and powered by a Zustand context store for state management. We use Supabase for user auth and data storage (Postgres databases + buckets), and Heroku for hosting our models.

Each user’s Vox Key is generated by Claude and recorded 10 times to capture unique vocal traits. We extract embeddings using OpenSMILE and store them as part of each user’s voice profile.

Our three-stage verification pipeline includes:

OpenAI Whisper for passphrase transcription
SpeechBrain for speaker verification
OpenSMILE + sigmoid scoring to detect nuanced deepfake artifacts

To improve robustness, we generated adversarial examples using ElevenLabs voice cloning tools, allowing us to fine-tune the system with both authentic and deepfake data.

Challenges we ran into

There were no pretrained or unified solutions capable of countering cutting-edge generative voice models, so we had to engineer our own from scratch. This included designing evaluation procedures, creating datasets, and testing multiple models to build a security pipeline that even state-of-the-art deepfakes—like ElevenLabs v3—couldn’t fool.

While we had prior experience with React, we were new to React Native and mobile development. Finding and learning the ins and outs of suitable audio recording libraries was challenging. Animation tuning and mobile-specific bugs consumed much of our limited 24-hour window.

We also ran into delays when FFmpeg, our media conversion tool, corrupted audio files into formats incompatible with our models, stalling integration and testing throughout the hackathon.

Accomplishments that we're proud of

We’re proud of building a fully functional mobile app in just 24 hours, despite being new to React Native. We developed a deepfake-resistant voice authentication pipeline that performed better than human judgment in many cases. We successfully integrated OpenAI Whisper, Claude, SpeechBrain, and OpenSMILE into a multi-stage verification system. Our UI is clean, intuitive, and animation-enhanced, offering a user-friendly experience without compromising security. Additionally, we generated synthetic deepfake audio using ElevenLabs to strengthen our model against real-world threats.

What we learned

We learned how to extract and compare voice embeddings using OpenSMILE to detect deepfakes, and how to integrate ML tools like Whisper and SpeechBrain into a real-time mobile app. We gained hands-on experience with React Native, mobile-specific debugging, and building secure, user-friendly interfaces under time pressure.

What's next for Vocera

We aim to optimize real-time performance, expand to on-device processing for privacy, and position Vocera as a new standard for voice authentication in banking, identity verification, and sensitive communications.