SampleRNN

Posted by Piotr Kozakowski & Bartosz Michalak on Sun 19 February 2017

One WaveNet issue

First of all, a quick recap - some time ago Piotr was checking out WaveNet, a DeepMind's model for signal generation, which is already well known in the field. While WaveNet in the original blogpost was said to generate quite promising results in the field of music generation, we were unable to reproduce these results, probably because of our lack of computational resources (see our WaveNet post for details). Apart from the results quality, we also pointed one major concern about WaveNet's architecture overall - it's just really complicated. Not only the process of understanding what's going on was tough - but after we actually made our minds on how it works it appeared clear that any modifications to the model will be a lot of work. Actually, before we could even think about developing the model further, we needed to at least achieve DeepMind's results quality to have some sensible baseline. That would require some extensive hyperparameter tuning - and because WaveNet is really computationally heavy, that would mean a lot of time.

Nevertheless, WaveNet was a great breakthrough - when we began investigating it, there was no other model with such good results in sample-by-sample audio generation.

Still, there are a lot of issues concerning WaveNet.

And apparently, it was one of WaveNet's issues that lead us to a much more promising model.

SampleRNN

SampleRNN is "An Unconditional End-to-End Neural Audio Generation Model", invented by Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Manuel Rodriguez Sotelo, Aaron Courville and Yoshua Bengio. As the description says, this model, just as WaveNet, is an end-to-end model for generating audio sample-by-sample.

Unlike WaveNet, which is based on convolutional layers, SampleRNN is based on recurrent layers. According to the authors, GRU units work best, but it’s possible to use any type of RNN unit. Those layers are grouped into “tiers”. The tiers form a hierarchical structure: in every tier, output from a single timestep conditions several timesteps from a lower tier by a learned upsampling. Hence different tiers operate at different clock-rates, which means that they can learn different levels of abstraction. For example, in the lowest tier one timestep corresponds to one sample and in the highest tier one timestep might correspond to one fourth of a second, which may include one or even several notes. This way we could go from representation as a sequence of notes to a sequence of raw samples. Furthermore, every timestep in every tier is conditioned by samples generated in the previous timestep in the same tier. The lowest tier is not recurrent, but is an autoregressive multi-layer perceptron conditioned by several last samples and the output of the higher tier.

Let’s look at an example (image from the original paper):

Here we have 3 tiers. The lowest one is an MLP which takes as input 4 last samples and upsampled output from the middle tier. The middle tier conditions 4 timesteps from the lowest tier and takes as input 4 last generated samples and upsampled output from the highest tier. The highest tier conditions 4 timesteps from the middle tier and takes as input 16 last generated samples, which is the total number of samples generated by the previous timestep of this tier.

As we can see, this model is quite easy to understand, as it consist of well-known recurrent layers separated by upsampling layers, which are ordinary linear transforms, and a multi-layer perceptron. It’s also computationally cheap because only the lowest tier operates at the sample level and the higher ones are slower, so they contribute less to the overall computation time. In contrast, WaveNet needs to compute output for every layer for every sample.

Details of the experiment

Fortunately, the authors of SampleRNN published their code, so we could use it for our experiments. We used mostly the default parameters for the 2-tier model, which, according to the authors, worked best for music:

  • 128 frames in a BPTT pass (64 was default)
  • frame size 16 samples (this is the number of samples conditioned by each timestep in the higher tier)
  • embedding size 256
  • no skip connections
  • dimension of RNNs and MLP 1024
  • 3 RNN layers in the higher tier
  • 256 quantization layers
  • linear quantization
  • batch size 64 (128 was default)
  • weight normalization
  • learning initial RNN state

We used the same datasets as in our WaveNet experiment: piano, guitar, classical music and ambient.

Finally, some actual music

If you follow our facebook page (to what we wholeheartedly encourage you), you may have already heard some music pieces from our favourite composer, Cervello Finto. As often in the art environment, Cervello Finto is an artistic pseudonym - this time, not for an actual artist but for the neural network we described above (who said pseudonyms are only for people?!).

But that's not all! With SampleRNN, we are happy to provide more entertaining outputs:

piano:

guitar:

classical:

ambient:

Well, the results certainly aren’t perfect, but definitely miles ahead of both our WaveNet results and what we retrieved from the generated spectrograms. Guitar and ambient samples don’t sound so good yet, but piano and classical samples are very promising. As for generation time, SampleRNN vastly outperforms WaveNet (we shall stick to comparing these two models, since GAN approach is quite different).

As you can hear, one surprising outcome is the quality of classical music sample in comparison to guitar one. Intuitively, classical music should be harder to learn since it includes several instruments with different playing styles, while guitar is well… just guitar. It is truly amazing (and in fact quite amusing) to hear the network learned to play the clash cymbals (and one can say it truly has a knack for it) along with some clearly heard wind instruments - and maybe even some bowed string instruments.

Is this the way to go?

As we could see (or hear, actually), SampleRNN seems to beat WaveNet in music generation on many levels - let us only mention better results with less effort, faster generation times and - what's very important - a simple idea code. Naturally, Piotr shifted all his attention towards SampleRNN, leaving WaveNet behind at least for a longer while.

What's more, SampleRNN may be presented on ICLR! We hope to hear some more details or hints from the authors, along with their future plans about the model!

tags: RNN, SampleRNN


Comments !