Music producers can have a big influence over the sound of an album. This model can be used for two main purposes:
- Music Discovery: Services like Pandora and Spotify leverage their ability to find music users will like. Production "sound" is another dimension that users may enjoy exploring when searching for new music.
- Music Publishing: The creation and maintenance of a database of music production and ownership credits has been an historically difficult task. When streaming services pay royalties to record labels often creative collaborators do not get paid properly because of missing documentation. This is a step toward "fingerprinting" the creators of a song.
- Spotify API - Contains audio files and song metadata.
- Wikipedia - Record producer labeling.
Identifying a record producer lies in the timbre of a sound. Timbre can be thought of as the "quality" or "identity" of a sound. It's what allows us to tell a flute from a trumpet even if they are playing the same notes. Timbre can be found in the higher-frequency overtones of a sound.
Audio mp3 clips 30-seconds long from 1000 songs (10 producers, 100 songs each) were converted to .WAV files and run through a highpass filter to accentuate the timbre frequencies. For each clip, the Mel-Frequency Cepstral Coefficients (MFCCs) were calculated.
MFCCs, very generally, are a set of values that correspond to the timbre of a sound.
More technically, MFCCs are calculated by first taking the Fast Fourier Transform (FFT) of a waveform to convert from amplitude-time space to frequency-time space. Then, each frequency power spectrum of the FFT is treated as its own wavelet and is decomposed further using the Discrete Cosine Transform (DCT). The resulting values are the Mel-Frequency Cepstral Coefficients. The figure below shows an example of the audio processing.
After processing, each song has about 24,000 MFCCs (20 in the frequency dimension, 1200 in the time dimension). Principal Component Analysis (PCA) was used to reduce the dimensionality to 12 sonic eigenvectors.
A K-Nearest Neighbors (KNN) algorithm was used to identify the most likely producers for any new song. The figure below shows how an example of how the KNN algorithm works.
The model was tested on a 300-song testing set. The multiclass accuracy for 10 balanced classes of producers was 44% compared to a baseline of 10%.
The data were then plotted on a 2D t-SNE plot to show the relative clustering of songs.
An interactive t-SNE plot can be found here.
- Deconvolution of Variables:
- Artist/Album/Instrumentation
- More accurate labeling
- Scale:
- More songs/producers
- Parallelize and deploy on AWS/Spark
- Feature Engineering:
- More Audio Processing/Reverse Engineering
- Remove music structure by breaking songs into beats
- Modeling:
- Neural Networks with Tensorflow/Keras



