ConvMusic

Who: By Timothy Fong, Edward Wibowo, Michael Rehmet, Bokai Bi

Introduction: What problem are you trying to solve and why?

osu! is a rhythm game whereby players click notes in sequence according to the rhythm of a song. Creating "beatmaps" (that map circles to the rhythm of a song) is a time consuming task, so we wanted to create a model that could streamline this process. We aim to make a model that can convert any song file (like mp3) into a playable osu! map.

The problem is a structured prediction problem. Given a sequence of data (representing a song), we want to "predict" a viable osu! map that maps to the rhythm of the song. The output, which is a beatmap, can be thought of as a sequence of note positions and timings (among other metadata). We would like to employ audio feature extraction techniques in both the preprocessing stage and training stage.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

Here is a blog post that does something similar: https://www.nicksypteras.com/blog/aisu.html This blog post delves into how to view and analyze song data as a spectrogram, revealing the frequencies and amplitudes of the songs over the song's duration. It also talks about a CNN architecture using multiple convolution layers, max pooling, and batch normalization. Furthermore, it also talks about domain-specific details, including how to encode "sliders" and "hit objects" (elements of an osu! beatmap) as one-hot encoded vectors. Overall, this blog post provides insight into how we can get started with our own model.

A key takeaway from this blog post is that accuracy is quite hard to achieve, but since a song can have many interpretations as a beatmap, we don't see this as an obstacle that must necessarily be overcome.

Data: What data are you using (if any)?

The official osu! website contains a large, expansive, and freely accessibly repository of beatmaps along with their associated song file. New beatmaps are being uploaded everyday and can be filtered by difficulty (for example, 4-6 stars). Our plan is to download large playlists of similarly-difficult beatmaps and designated the song files as our input and the osu! beatmap files as the output.

The blog post mentioned earlier downloaded approximately 25 GB of osu! beatmaps, so we suspect we need around as much (if not more).

Methodology: What is the architecture of your model?

We will explore various architectures to train the model (CNNs, LSTMs, RNNs, Transformers) to see what works best. We think the hardest part of this task is to ensure that the output of the model is actually parseable by osu!. Not only must the timings and placements of each hit object be valid, but also it must be correct in relation to the metadata generated. The osu! beatmap file is prone to errors which can complicate the process.

Metrics: what constitutes success?

Although we will train the model to prioritize accuracy, our main criteria for success is the playability of the beatmap. This means the osu! game must be able to parse the outputted map and the map must be playable by the player (not too difficult). Furthermore, the notes must more or less accurately represent the music, although this will mostly be judged qualitatively. This is because multiple people might interpret a song's rhythm differently when translating it into a beatmap, so there isn't a clear objective right answer when it comes to "mapping" a song.

In case things go awry, we can hardcode the metadata portion of the osu! beatmap and instead just produce a sequence of vectors representing simple hit objects (so no sliders, no timing information etc).

To summarize:

Base goal: Make a model that can output a playable osu! map with predetermined metadata (so no need to predict BPM and difficulty names etc.)
Target goal: Also make the model predict metadata.
Stretch goal: Make beatmap difficulty a parameter, so one could request an "easier" or more "difficult" map. Perhaps also incorporate sliders, spinners, and other game-specific elements.

Ethics

Why is Deep Learning a good approach to this problem?: The task of creating a beatmap from a song doesn't have a true objective answer, so making an efficient deterministic algorithm to do so doesn't make sense. However, there still exists a notion of a beatmap "feeling right" and "complementing a song well". Capturing this notion is difficult to do with plain arithmetic and algorithms, so deep learning seems like a decent solution.
What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?: The dataset is collected from osu!'s public repository. While osu! does not have an official policy to warrant against training models with their data, a creator of a beatmap might not want their data being used to train the model. To this end, it seems important to keep track of what data was used to train the model, and make it easy to retrain in case one wants to avoid having their beatmap used to train the model.

Division of labor: We think it would be useful (both as a learning opportunity and to ensure our model is the best it can be) if everyone has the opportunity to contribute at each stage of the development of the model. That being said, we can still roughly divide the project into data collection, preprocessing, training, and also more theoretically investigating ways to represent the output as a sequence of vectors.