Inspiration

In the modern age short form content has become the most popular form of media consumption as seen through TikTok, Instagram Reels, and Youtube Shorts. With this new style of media it has also popularized a new style of music. This style of music can be characterized through the prioritization of having a viral sounding snipet somewhere in the song. Whether it be because of lyrics, instrumentals, or some other musical element, songs are becoming increasingly known for only snipets rather than the full piece as much of the public discovers new music through short form content with these snipets. With this in mind a version of this new style of music has become extremely popular. It combined two songs by mixing the musical elements of both known as a mashup. Likely due to their uniqueness mashups became extremely popular on short form media apps. It is an inherently creative style of music as these mashups are usually not made by the original creators of the songs but by third parties. It sparked a trend of both popular and less well known djs to create mashups of all sorts of songs. Our thought as a team was how could we make these mashups despite lacking the musicial knowledge to make them the traditional way. Thus came the idea of MashLab a website where users could upload their own MP3S of songs and turn it into a mashup. Something we also took into consideration is that when using computers to generate art it was important to incorporate a level of subjectiveness, creativity, and indvidualism that characterizes non-computer generated art.

What it does

Our website prompts users to upload two MP3s of songs (and their respective song names) that they would like to turn into a mashup. It then creates a mashup of the two songs that the user can listen to and prompts the user to say whether they "like" or "dislike" the mashup. In the backend we have a machine learning model that takes this user opinion to generate a "likability score" that is unique to the user and tells the user based on their opinion whether they would like the mashup they created. The idea behind this was that when using computers to generate art it is important to incorporate a level of subjectiveness, creativity, and indvidualism that characterizes non-computer generated art. Because art is an inherent form of human expression that is influenced by emotional experience, something that computers cannot replicate, we found the best strategy was to implement the machine learning model that was trained on the users uniquely human opinion.

How we built it

On the Frontend we used React, JSX, and Vite to create an interactive environment where users upload two MP3s. This pointed to our backend where we used an algorithm to determine the possibility of mashups, the best location to start mashing up in both songs, and how to mashup the songs based on their unique musical components.

The Backend begins by transforming the raw song into a feature vector representation. It converts it to mono waveform, computes spectral and rhythmic features and returns {bpm, key, mode, energy, spectral_centroid}. The beat tracking is done through a mathematical formula: flux(n)=k∑​max(0,∣X(k,n)∣−∣X(k,n−1)∣) for spectral flux. Beat interval is estimated using autocorrelation. and Tempo is 60/Beat Interval. The beat timestamps then become a rhythmic grid to align tracks together.

At a high level the system transforms each song from raw audio into numerical data ( a large array of amplitude values sampled over time). Signal processing tools extract musical features. These features become compact numerical vectors that summarize the musical identity. The code then compares the feature vectors of two songs to determine whether they make a good mashup. Once compatible segments are selected, Demucs, our audio transformer, splits each clip into stems (vocals, instrumentals). Three factors determine the initial evaluation of how compatibility between two songs is calculated.

The Key compatibility is a value from 0.0 - 1.0 that describes how similar the keys between the tracks are. Key detection is done through the computation of a 12-dimensional vector that represents energy in each pitch class. The vector is compared against 24 key templates. Similarity is measured using cosine similarity (score = c * template / ||c|| ||template||), the template with the highest score determines key and mode.

Tempo Compatibility ensures the songs can be stretched to match each other’s beats without distortion. It sees if tempos match pre-determined interactions like 1:1, 1:2, 0.5:1. If one matches, it calculates stretch percentage (((BPM Song A) / (ratio * BPM Song B)) - 1) * 100.

Frequency Compatibility defines how distinct the songs are in pitch, avoiding masking where one song clashes with the other. This is defined by log2(CA/CB), where CA and CB are the average pitches of each track respectively. This metric explains how many octaves apart each track is to each other.

The final Compatibility Score = 40+60(0.45 * Key +0.35 * Tempo + 0.20 Frequency)

Our Segment Finder Engine helps to find the most musically interesting section of the track for mashing. This is determined by finding the most impactful portions of the song (Beat Drops, Verse Transitions, and Tempo Changes) through a score s = 0.45 * Segment Decibels + 0.30 * Sound Fluctuation + 0.25 * Rhythm Speed

Our mashing algorithm has been fine tuned to make sure songs sound like a single, polished track. First, it finds the alignment by calculating the cross-correlation of the vocal’s rhythm and the instrumental’s beat, finding which ~20 second segment has the most overlap. The beats between the two are aligned, and the waveforms of the Instrumental and Vocals are summed together. Post processing is done to make sure the entire segment has a high correlation, or else it is pruned from the segment.

We used a plethora of audio processing and Audio ML libraries, starting with the library librosa as a high level tool for analyzing music doing most of the mathematical audio analysis (beat,tempo,spectral,chroma,onset), NumPy for fast numerical computing, all waveforms are stored as NumPy Arrays. SoundFile library is used for reading and writing of the audio file, it loads NumPy arrays so it can be processed mathematically. Pydub is an audio manipulation library that specializes in practical operations like trimming segments, exporting MP3, and applying fades. Demucs is a deep-learning model for music source separation. It separates songs into stems (vocals, drums, bass, instruments) ; this is the primary engine behind extracting the vocals and instrumentals of songs that we can then mash.

Challenges we ran into

The primary issue that we faced was modelling the features that make a mashup 'good'. Art is inherently subjective, and the limited literature on machine learning for music generation made it difficult for us to pinpoint the realistic methodologies of blending two different tracks together. While we initially tried both simple rulesets and complex statistical techniques, the lack of training data and small sample size made it difficult for us to find a significant relationship of good music combinations that did not overfit. However, we eventually decided on a hybrid approach. We used orthogonal music components like key, tempo, and frequency to highlight basic structural components of harmonious music, and using a personalized online logistic regression model to capture the user's personal preferences. Furthermore, we drew inspiration from the paper 'Modeling the Compatibility of Stem Tracks to Generate Music Mashups', which helped us understand how to model and feature engineer signals from MP3 data. Additionally, we also consulted musicians to understand the basic fundamentals of what makes a mashup sound cohesive.

Accomplishments that we're proud of

Basing our project on a machine learning model, more-so on a niche topic like music forced us to constantly reevaluate our infrastructure. We learned how to anticipate for errors, verify the accuracy of our model, and consider the limitations of our current setup as we tested new mashups.

What we learned

While we all started with basic knowledge of music theory, this project also taught us the elements of music design and taste. In order to model track compatibility, we had to understand the differences in key, genre, tempo and structure and how they interact between different tracks. Having to work through these complexities gave us a better understanding of the fundamentals of music and what goes behind a listener's preferences in a successful mashup.

What's next for MashLab

Our project was made in mind of our users. People who do not necessarily have an in-depth understanding of music theory, but want to create unique art that themselves and others can enjoy. While our current infrastructure allows creators to personalize their experience through the rating system, the next iteration of MashLab could allow users to edit songs directly. Using our model to find an optimal location to mash up songs, users could adjust the pitch, beat, and vocals of songs to better match their preferences.

To facilitate sharing, we could also make our platform fully online. Users would be able to publish mashups that they generate, while other users would be able to listen, rate, and discover new combinations of music they enjoy. Ultimately, the end goal of MashLab is to bridge the gap between those with music literacy and those without it, making music creation accessible to everyone with an idea.

Built With

Share this project:

Updates