DeepSaber: Generating Beat Saber levels using Machine Learning

20 Jul 2019

During Hilary and Trinity Term 2019, OxAI Labs piloted its premier research project with a team of seven mixed-discipline Oxford students. The objective of the team was to investigate the development of an Artificial Intelligence (AI) level generator for the Beat Saber Virtual Reality (VR) game.

Beat Saber (May 2018) is a music-based VR game where players use hand-held motion trackers to slash through incoming blocks, positioned according to the rhythm and melody of a song. The player can also be expected to move to avoid obstacles and explosive blocks as part of a song. Beat Saber gameplay might be compared to Guitar Hero and Fruit Ninja (video here), as a mixed rhythm game inclusive of saber dynamics. Beat Saber has become immensely popular in the VR community, however on release it only provided a limited number of song levels to play. This has led to a great contribution from the mod community to introduce custom songs to Beat Saber (i.e. Beatsaver Link and others). These community-led projects facilitate players to create song levels via an editor as well as export them and share them with the world.

The challenge of automatically generating fun and rhythmically coherent Beat Saber levels caught the interest of OxAI Labs as an interesting intersection of machine learning and audio processing, realisable in a virtual-reality setting. What’s more, a successful investigation would stand to proffer the devoted Beat Saber community with an additional tool for a continuing enjoyment of the game.

I am happy to share with you the illustrations of our process, the winding considerations, challenges and successes. This project would not have been possible were it not for the amazing team, Andrea, Guillermo, JJ, Mackenzie, Michael, Tim and Ralph (myself). The creativity and drive of each member inspired a great collaboration, of which the fruits shall be shared forthcoming.

At the start of our project we began, as could well be expected from a team of AI nerds, by defining level generation in a computational way. For human creators, level creation consists of placing blocks and obstacles at “events” in the song, corresponding to beat drops, melody changes, etc., such that block sequences follow song-specific patterns. Hence, we defined our AI level generator’s task as learning to map raw waveforms (how computers process audio) to Beat Saber levels from human-created examples, such that this mapping approximates human judgement for musical “events” as reliably as possible. To this end, we collected a large dataset of levels from BeatSaver.com and collected the 1000 most downloaded levels (as of February 18th, 2019 ). We augmented our data set by shifting in the frequency domain to obtain 8 times as many training samples.

Using this collected data, we began looking at representations and features for raw audio that can capture these musical events, and looked at related work in speech recognition and audio processing. There, raw audio, itself too bulky to be efficiently analysed, is converted to more compressed features such as Mel spectrograms and chromagrams, and then used for downstream tasks. Both spectrograms and chromagrams relay events such as beat drops and note changes, which are important for level generation. Therefore, we experimented with both features during our work.

We then defined representations for Beat Saber levels that can concisely, yet comprehensively, capture a wide range of block timings and block combinations. In theory, blocks/obstacles can appear at any real-valued time, which makes generation highly intractable, so we discretised time to intervals small enough not to be noticeable by humans. We came to realise that even with just blocks and bombs, there are 20 possibilities for a single cell which, with Beat Saber’s 4x3 grid, yields 2012 (≈4.1 x 1015 ) theoretically possible states at any given time! In our earlier models, we avoided this exponential blow-up by introducing independence between cells. Later, we modelled the full block configuration space such that any configuration can still theoretically be returned. However, we also introduced a new state representation using only the top 2000 configurations of blocks in the training set, which accounted for 99.53% of all configurations. This dramatically simplified the output space, and provided the model with a strong inductive bias towards human levels, albeit at the expense of not producing original new configurations.

We began the process of implementing models over the course of several weekly meetings and weekend work sessions. We started with a simple rule-based system that would create blocks based on dominant chroma, and built upon on that by restricting the size of the state-space and adding some stochastic decisions, with unit-testing checks to ensure level compliance. We built our first machine learning model towards the end of March, an adapted non-causal WaveNet model, and trained using maximum likelihood. Our model used a time discretization that was relative to the tempo (Beats Per Minute) of the input song (1/16th of a beat), and performed in a manner that was encouraging, though not entirely convincingly. We tried using Generative Adversarial Networks (GANs) with this model to incorporate a sense of “authenticity” into the training process, but did not observe any dramatic improvements, especially given the additional computational costs versus benefit of this change.

Towards the start of Trinity, inspired by similar work for the Dance Dance Revolution game (Dance Dance Convolution (DDC)), we developed a new two-stage model for level generation that divides the task into 2 parts: block placement (when to do something) and block selection (what to do). The first component of this model returns a binary “place something” flag at every time discretisation, and the second component chooses a configuration for flagged times given audio context.
We adopted and adapted the DDC model by using our earlier WaveNet model for the first stage, supplementing it with a peak-selection algorithm to avoid consecutive “place something” cues, and a Transformer for the second stage. We found that, given the modularity of this model, we were also able to use the first stage of the original DDC with our second-stage transformer to perform additional experiments without much additional labour.

Nearing the culmination of the project we trained our end-to-end WaveNet and two-stage models in two cloud instances containing V100 GPUs. These resources were made available with the support of a computational resources grant from the Google Cloud Platform. We experimented with a number of setups, especially for the first stage of the two-stage model. Ultimately, our (subjectively judged) best-performing model was the two-stage model, containing the first stage architecture of Dance Dance Convolution. This model produced realistic levels with mostly reasonable timings and block patterns. It could also produce levels of variable difficulty, simply by varying the sensitivity of the block placing stage of the algorithm.

On May 29th, OxAI Labs hosted its premier demo day to present the results of the Beat Saber project in Trinity College, Oxford. The demonstration consisted of a 40-minute presentation followed by a demo on a VR headset for attendees to try out AI-generated levels. We also presented the Beat Saber project at the Oxford Immersive Technologies event at the Mathematics Institute on the 21st of June, and were grateful to receive overwhelmingly positive feedback. We aim to make our models, data and code open-source before the end of August 2019 and following that, to develop an interface through which Beat Saber players can easily use our tool. Some of the project members have expressed an interest to pursue further research into incorporating difficulty and playability analysis as part of the generation process, by using metadata available online about human-created levels.

The completition of the Beat Saber project represents OxAI Labs’ first major enterprise into community-led research. The project was intended to encourage collaboration and learning and I am very happy to confirm the value that we have each taken away during its pursuit. We aim to build on the success of Beat Saber and will soon be announcing new projects for Michaelmas 2019. We look forward to hosting new community collaborations soon and are very excited for the developments that lie ahead.

Make sure to Subscribe to the OxAI Labs newsletter if you are interested in participating in a future OxAI Labs Research Project.

DeepSaber team

Written by
Ralph Abboud
ralph.abboud@oxai.org