https://mtg.github.io/melon-playlist-dataset/
Read our ICASSP 2021 paper for more details:
Melon Playlist Dataset: a public dataset for audio-based playlist generation and music tagging. Ferraro A., Kim Y., Lee S., Kim B., Jo N., Lim S., Lim S., Jang J., Kim S., Serra X., & Bogdanov, D. In 20th International Society for Music Information Retrieval Conference. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2021), 2021.
One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091 tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered from Melon, a popular Korean streaming service. The dataset is suitable for music information retrieval tasks, in particular, auto-tagging and automatic playlist continuation. Even though the latter can be addressed by collaborative filtering approaches, audio provides opportunities for research on track suggestions and building systems resistant to the cold-start problem, for which we provide a baseline. Moreover, the playlists and the annotations included in the Melon Playlist Dataset make it suitable for metric learning and representation learning.
One of the challenges in making the distribution of this data possible was to identify the trade-off settings for the mel-spectrograms frequency/temporal resolution, which would allow to significantly reduce the data size with only minor or negligible loss in performance. To this end, we did a preliminary study, analyzing the performance of several music auto-tagging models under different resolutions:
How low can you go? Reducing frequency and time resolution in current CNN architectures for music auto-tagging. Ferraro A., Bogdanov D., Serra X., Jeon J. H., & Yoon J. The 28th European Signal Processing Conference (EUSIPCO 2020), 2020.
Automatic tagging of music is an important research topic in Music Information Retrieval and audio analysis algorithms proposed for this task have achieved improvements with advances in deep learning. In particular, many state-of-the-art systems use Convolutional Neural Networks and operate on mel-spectrogram representations of the audio. In this paper, we compare commonly used mel-spectrogram representations and evaluate model performances that can be achieved by reducing the input size in terms of both lesser amount of frequency bands and larger frame rates. We use the MagnaTagaTune dataset for comprehensive performance comparisons and then compare selected configurations on the larger Million Song Dataset. The results of this study can serve researchers and practitioners in their trade-off decision between accuracy of the models, data storage size and training and inference times.
As Melon Playlist Dataset contains different types of metadata (playlists, tags, and genres), it is suitable for multi-modal learning. We have explored that applying contrastive learning using playlist and genre information to generate audio embeddings:
Enriched music representations with multiple cross-modal contrastive learning. Ferraro A., Favory X., Drossos K., Kim Y., & Bogdanov D. IEEE Signal Processing Letters. 28.
]]>Modeling various aspects that make a music piece unique is a challenging task, requiring the combination of multiple sources of information. Deep learning is commonly used to obtain representations using various sources of information, such as the audio, interactions between users and songs, or associated genre metadata. Recently, contrastive learning has led to representations that generalize better compared to traditional supervised methods. In this paper, we present a novel approach that combines multiple types of information related to music using cross-modal contrastive learning, allowing us to learn an audio feature from heterogeneous data simultaneously. We align the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss. We evaluate our approach in three tasks, namely, genre classification, playlist continuation and automatic tagging. We compare the performances with a baseline audio-based CNN trained to predict these modalities. We also study the importance of including multiple sources of information when training our embedding model. The results suggest that the proposed method outperforms the baseline in all the three downstream tasks and achieves comparable performance to the state-of-the-art.
https://github.com/minzwon/sota-music-tagging-models

Evaluation of CNN-based automatic music tagging models. Won, M., Ferraro A., Bogdanov D., & Serra X. In 17th Sound and Music Computing Conference (SMC2020), 2020.
]]>Recent advances in deep learning accelerated the development of content-based automatic music tagging systems. Music information retrieval (MIR) researchers proposed various architecture designs, mainly based on convolutional neural networks (CNNs), that achieve state-of-the-art results in this multi-label binary classification task. However, due to the differences in experimental setups followed by researchers, such as using different dataset splits and soft-ware versions for evaluation, it is difficult to compare the proposed architectures directly with each other. To facilitate further research, in this paper we conduct a consistent evaluation of different music tagging models on three datasets (MagnaTagATune, Million Song Dataset, and MTG-Jamendo) and provide reference results using common evaluation metrics (ROC-AUC and PR-AUC). Furthermore, all the models are evaluated with perturbed inputs to investigate the generalization capabilities concerning time stretch, pitch shift, dynamic range compression, and addition of white noise. For reproducibility, we provide the PyTorchimplementations with the pre-trained models.
We are happy to announce recent updates to the Essentia audio and music analysis library introducing TensorFlow audio models!
Find more about new algorithms and pre-trained models for deep learning inference that you can use in C++ and Python applications in our blog posts.
The algorithms we developed provide a wrapper for TensorFlow in Essentia, designed to offer the flexibility of use, easy extensibility, and real-time inference. They allow using virtually any TensorFlow model within our audio analysis framework.
The wrapper comes along with a collection of music auto-tagging models and transfer learning classifiers that can be used out of the box.
Some of our models can work in real time, opening many possibilities for audio developers. We provide a demo on how to do that.
For example, here is the MusiCNN model performing music auto-tagging on a live audio stream.
You can use all new functionality and models for deep learning inference to develop your entire audio analysis pipeline in C++ or Python. For quick prototyping, we provide Python wheels on Linux (pip install essentia-tensorflow, pip version ≥19.3).
See our ICASSP 2020 paper for more details.
]]>
For all researchers working on music auto-tagging and related tasks, we present a new dataset!
MTG-Jamendo contains over 55,000 full audio tracks labeled by 195 tags: https://mtg.github.io/mtg-jamendo-dataset/
All audio is taken from Jamendo and is publicly available under Creative Commons licenses. The audio files are full songs encoded in high quality as 320 kbps MP3s. All tags were originally provided by artists and curated by the Jamendo to ensure basic quality assurance.
All tags are categorized into three subsets including genres, instruments and mood/theme tags. In addition, we provide a top-50 tags data subset. This makes it possible to work on the generic music auto-tagging, similar to other existing tag datasets, as well as specific subtasks such as genre recognition, instrument detection, or mood/theme recognition.

We provide a VGG-ish baseline solution to all these tasks.
The dataset has been presented at the Machine Learning for Music Discovery Workshop at ICML 2019 (read our paper).
We hope that this dataset will be a valuable addition to auto-tagging research, complementing other datasets in the MIR community.
We have already put this data to use in the MediaEval Emotion and Theme in Music Task challenge this year, which we will continue to organize next year.
]]>
We are pleased to announce a new MIR-related task held within the MediaEval 2019 evaluation campaign: Emotion and Theme Recognition in Music Using Jamendo.
The Benchmarking Initiative for Multimedia Evaluation (MediaEval) organizes an annual cycle of scientific evaluation tasks in the area of multimedia access and retrieval. In our task, we invite the participants to try their skills at predicting mood and theme tags associated with music recordings using audio analysis and machine learning algorithms.
The task is framed as an auto-tagging problem with tags specific to moods and themes (e.g., happy, dark, epic, melodic, love, film, space). To build the dataset for this task we used a collection of music from Jamendo that is available under the Creative Commons licenses with tag annotations that come from authors/musicians.
All interested researchers are warmly welcomed to participate. See the MediaEval Multimedia Benchmark Workshop for the results of the challenge.
]]>One of the challenging problems in music informatics is how to map genre taxonomies across different music databases.
For example, in the context of music information retrieval, classification tasks typically rely on an agreed answer for ground truth. What should we do, if we can’t find agreement between our ground truth? What if different sources use a label, but source has a different definition?
We’ve been mining metadata for AcousticBrainz for some years, resulting in a new research dataset containing genre annotations from different online sources for up to 2 million tracks with the extracted music audio features.
https://mtg.github.io/acousticbrainz-genre-dataset/
Read our ISMIR 2019 paper for more details:
The AcousticBrainz Genre Dataset: Multi-Source, Multi-Level, Multi-Label, and Large-Scale. Bogdanov, D., Porter A., Schreiber H., Urbano J., & Oramas S. In 20th International Society for Music Information Retrieval Conference (ISMIR 2019), 2019.
]]>We present the AcousticBrainz Genre Dataset, a large-scale collection of hierarchical multi-label genre annotations from different metadata sources. It allows researchers to explore how the same music pieces are annotated differently by different communities following their own genre taxonomies, and how this could be addressed by genre recognition systems. Genre labels for the dataset are sourced from both expert annotations and crowds, permitting comparisons between strict hierarchies and folksonomies. Music features are available via the AcousticBrainz database. To guide research, we suggest a concrete research task and provide a baseline as well as an evaluation method. This task may serve as an example of the development and validation of automatic annotation algorithms on complementary datasets with different taxonomies and coverage. With this dataset, we hope to contribute to developments in content-based music genre recognition as well as cross-disciplinary studies on genre metadata analysis.
Over the past year and something, I’ve been supervising an R&D project run in collaboration with a digital music distribution service Sonosuite by La Cupula Music in which we developed software for automated audio quality analysis.
This project will take some burden off of the shoulders of their quality control team that needs to know there are no unexpected nasty audio problems before they push the content to the streaming services and shops worldwide.
Meanwhile, we present it at the AES 2019 Audio Engineering Society Dublin Convention. Read our paper for more details:
Automatic Detection of Audio Problems for Quality Control in Digital Music Distribution. Alonso-Jiménez, P., Joglar-Ongay L., Serra X., & Bogdanov D. In AES 146th Convention, 2019.
Providing contents within the industry quality standards is crucial for digital music distribution companies. For this reason an excellent quality control (QC) support is paramount to ensure that the music does not contain audio defects. Manual QC is a very effective and widely used method, but it is very time and resources consuming. Therefore, automation is needed in order to develop an efficient and scalable QC service. In this paper we outline the main needs to solve together with the implementation of digital signal processing algorithms and perceptual heuristics to improve the QC workflow. The algorithms are validated on a large music collection of more than 300,000 tracks.
Furthermore, all these new algorithms are now available in Essentia.
]]>
Over the past months, we’ve been preparing a genre recognition task based on the vast amounts of music data we gathered in AcousticBrainz database. It is now a part of MediaEval 2017, a benchmarking initiative that organizes an annual cycle of scientific evaluation tasks in the area of multimedia access and retrieval.
The task is about music genre recognition: we want to build systems that are able to predict genre and subgenre of unknown music recordings (songs) given automatically computed music audio features of those recordings.
It is a popular problem in Music Information Retrieval, however the task that we propose is somewhat different, more detailed and more challenging:
There are different genre taxonomies and people may not always agree on the meaning of genres. Genres labels are probably subjective categories. We want to explore how the same music can be annotated differently by different communities following different genre taxonomies, and how this should be addressed by genre recognition systems. We provide four genre sources that come from different music databases. Their taxonomies vary in specificity, breadth and meaning of genre labels. These sources include explicit annotations done by music experts and annotations inferred from folksonomies.
Typically research is done on a small number of broad genre categories. In contrast, we propose to consider more specific genres and subgenres and our data contains hundreds of subgenres.
Genre recognition is often treated as a single category classification problem, which is not necessarily the way it should be. Our genre data is intrinsically multi-label and so we propose to treat genre recognition as a multi-label classification problem.
Typically research is done on small music collections. Instead, we provide a very large dataset counting two million recordings annotated by genres and subgenre. The downside is that we are not able to provide audio, but only precomputed bags of features.
Finally, we provide information about the hierarchy of genres and subgenres within each genre annotation source. Systems can take advantage of this knowledge.
I am co-mentoring one of the innovation challenges for this year’s edition of the Sónar+D festival. The idea for this challenge is to create a music instrument (hardware or software) that can help artists to discover, interact and play with Creative Commons audio content (samples or even entire music tracks) using our audio analysis technologies.
We are searching for music technologists, creative developers, and UI/UX designers. More details here.
]]>