Our party consists of Toby Cox, Ryon Peddapalli, and Jacob Davis!
Inspired by the Sam & Cat Magic ATM skit, we cast offkey, a tool for converting audio into replicable passwords and hashes for authentication and security sturdier than the knock spell! Popular tunes and songs can be easier to remember than strings of passwords, and offkey provides a solution to help prevent that pesky annoyance of forgetting your password. Offkey is valuable for authenticating voices over phone calls for important transactions for example, and helps assist elder community members and those with poor memory use their passwords, as music and sound are easy to recall.
Our backend is written in Python and we use Django for our frontend from JavaScript, HTML, and CSS code. We make use of the SciPy, NumPy, and Librosa libraries in Python, and offkey.tech is our domain name (pending registration). We have some tests using SpeechBrain but offkey ultimately does not use it. As for the actual model, the accuracy score combines the differences in values between multiple tests through a trained pytorch model with around 145 parameters through multiple layers. These parameters were essentially weights of different features. We can define a difference between two waves through the L2 norm, where we take the FFT of each wave to split it into its constituent sine waves. Then, we add up their coefficients' squares and take the square root. One of the more important features in our model was MFCC, which converts linear differences in the L2 norm to a more logarithmic scale, which emphasizes smaller frequencies, which human speech tends to clump into. Our model consisted of many more features similar to this.
Determining a solid metric and method for reliably classifying the audio took up most of our time. Factors that made this challenging include accounting for the time domain, understanding how to determine a voice different from another, and accounting for minute differences that could result in an entirely different hash. We explored methods including the Fast Fourier Transform, Mel-Frequency Cepstral Coefficients, and deep learning models with PyTorch. We landed on a traditional method using the Fast Fourier Transform for its relative computational efficiency and scalability, conducting a parameter space search to find an optimal configuration for determining suitable audio clips. The problem with this parameterization is that human voice tends to be hard to characterize. Most factors either only help differentiate either the human speaker or what the speaker is saying. This motivated the implementation described in the "How we built it" section, which combined multiple factors together in a model to create an accurate comparison.