Posted by Piotr Kozakowski & Bartosz Michalak on Mon 12 December 2016

What's up

It's been a while since was the last time we posted something - partially due to some personal stuff (studying actually takes time) and partially due to the nature of a problem we try to solve. Music Generation is a field rather underdeveloped - there are no "right ways" to go yet and the variety of projects that emerged just recently kind of proves this thesis.

We needed to decide how we want to tackle this problem - what to start with? As we already mentioned, we're mostly interested in a kind of end-to-end solution - generating signal from signal (with as little preprocessing as possible). Our approach is to use neural networks. One obvious thing to check out was one of the hottest Deep Learning papers published recently - Google WaveNet. Piotr decided to investigate the capabilities of this model.


So, what exactly is this Google WaveNet? From the official DeepMind's blogpost we read that it's a deep generative model of raw audio waveforms. Originally, it was developed for speech synthesis and is said to improve the current state of the art in this area. But how does it work?

WaveNet is actually a Convolutional Neural Network, which takes raw signal as input and synthesises output sample by sample. The layers used are actually atrous (or dilated) convolutional layers - a kind of convolutional layer in which each filter takes every n-th element of the input matrix, rather than a contiguous part. The layers are grouped into several “dilation stacks”. In each stack, each layer has twice the dilation of the previous layer, so i-th layer has dilation 2^i. The more layers we have, the bigger is the receptive field of this network - an amount of input it actually looks at, generating the next sample. This picture, taken from the mentioned blogpost, nicely visualizes the general idea:

Receptive field of the WaveNet

There are several dilation stacks on top of each other in the network, so the total dilation sequence looks like: 1, 2, 4, …, 2^n, 1, 2, 4, …, 2^n, …. Each dilation stack might be thought of as a single complex, non-linear convolutional filter with a really big receptive field but not so many parameters. The details can be read in both the blogpost mentioned above and the paper itself, containing much more information on the architectural and implementation details. We found the paper pretty hard to understand at first, as it utilizes quite a few sophisticated Deep Learning techniques. These include skip connections and gated activation units.

Skip connections are a technique used in Residual Neural Networks. The idea is to have two outputs from each layer. The first one goes directly to the next layer, as in regular neural networks. The second one, “skip connection”, bypasses the consecutive layers. Skip connections from all layers get added together to form the final output of the network. This approach is said to speed up convergence and enable training of much deeper models (quotation from the original paper). That’s because the gradient for the early layers doesn’t need to pass through all consecutive layers (that could cause the “vanishing gradient” problem), but can go straight from the output of the neural network through a skip connection. This way, theoretically, adding more layers cannot worsen the performance of the network, but certainly can make it better.

The gated activation units are actually a combination of tanh and sigmoid activations - two convolutional filters, each having its own activation are then element-wise multiplied for the final activation. Sigmoid activation filter acts like a “gate” for the tanh activation filter - it basically says how “important” is the output of the tanh filter.

Sounds strange? The paper gives a general idea on how it works, but what mostly helped us understand these ideas was this implementation. And even though we actually were able to run this code, something so 'simple' as running a neural network can sometimes be one hell of a lot of struggle...

On choosing the implementation and GPUs getting old

There currently are two publicly-available WaveNet implementations that we are aware of:

  • https://github.com/basveeling/wavenet. This one is quite slow in both learning and generating so we gave up on using it, but the code is very clear (written using Keras, a high-level neural network library for Python) so reading it helped us a lot in understanding the WaveNet model.
  • https://github.com/ibab/tensorflow-wavenet. This one is written in TensorFlow (a bit more low-level neural network library for Python), but is much faster, so we finally decided to use this one.

TensorFlow is a great library for deep learning research, but it has one drawback: it requires a recent graphics card for GPU acceleration. Each Nvidia GPU is described by a parameter called Compute Capability, which is basically a number designating the architecture of an Nvidia graphics card in terms of CUDA.

For our research we use Nvidia Tesla C2070 GPUs which we got access to from our University. Unfortunately, these have Compute Capability of 2.0 and TensorFlow needs at least 3.0. So we couldn’t use our University-issued graphics cards to run our WaveNet experiments and ended up using Piotr’s laptop’s GPU. Hence it took a long time to get any meaningful results - we could only train one model at a time and only overnight (apparently a laptop tends to be quite useful in Computer Science student’s everyday life).

If anyone managed to run TensorFlow on an Nvidia graphics card with Compute Capability 2.0, please let us know in the comments!

The effects and conclusions

For our experiments we used WaveNet implementation from the repository tensorflow-wavenet mentioned above, with a small modification made by us, available at this pull request. We used the following parameters:

  • sample rate 16000
  • 5 dilation stacks with max dilation 1024
  • 64 residual and dilation channels
  • 1024 skip channels
  • batch size 10000
  • no silence trimming

After training the network on four mixes we found on YouTube (piano, guitar, classical music and ambient), we are glad to present the masterpieces it generated:

trained on piano:

trained on guitar:

trained on classical music:

trained on ambient:

As you all might have just heard, these outputs are actually far from what we may call a masterpiece. The only model which sounds like it actually learned something we were hoping for is model for piano.

A few major overall conclusions about the results:

  • Piano is the only instrument that is recognizable in the generated output, meaning that may be the instrument that is easiest to model using similar architectures
  • The strong bursts hearable in the guitar may reflect the nature of playing chords or hitting the strings in general. These we may also have just made up to justify the model. What do you think?
  • It is very difficult, if at all possible, for this model to learn more than one instrument at a time or even any music that has a little too complicated structure, which we clearly see in the ambient and classical music outputs. The only uncertainty that arose among us remarking them was whether they actually resemble more seashore sounds or the wind blowing in the forest.

And about the model itself:

The main concern arises directly from the architecture - remember the atrous convolutions and the ‘receptive field’ thing? The field of perception at time t is limited to length_of_receptive_field samples in the past. The parameters used by us result in a receptive field of around 0.63s. The WaveNet paper mentions using receptive fields of several seconds for their music experiments. That’s probably the reason why our generated samples are so poor compared to DeepMind’s results shown in their blog post. Unfortunately, such big receptive fields are not possible for us due to our computational limitations. But we think that no matter how big receptive fields we use, the model itself will still be unable to generate any complete piece of music, as most of the songs hold general ideas - such as rhythm, melody and some higher level structure - over the whole piece. Local and global conditioning (as described in the original paper) do not seem to be able to address the issues mentioned above properly, as they generally are needed to be fed into training, while our aim is to lead the model to learn the higher-level song structure by itself. But is the end-to-end solution we were actually hoping for even possible? Clearly not using just this particular model.

One of our ideas to somehow enable the network to learn the mentioned abstraction layer needed for music generation was to add a different neural network, which would learn to “understand” high-level structure in music and generate some sort of “intermediate representation”, which would then be fed into WaveNet via local conditioning to generate the raw signal.

And here comes the other concern about this model: it just seems too complex. Learning it wasn’t impossible, but it surely wasn't that straightforward. Due to the amount of complexity, modifying the architecture and tweaking the hyperparameters is a pretty challenging job to do properly (not to mention the time needed to retrain such model and then generate some samples).

We do not give up on WaveNet for sure - but just for now, it needs to rest for a while, while we research other models, possibly better suitable for the task we are tackling.

Stay tuned

As we mentioned before, Piotr was the one actually investigating WaveNet - during this time, Bartek was trying to figure out how can we possibly use some other generative model - Generative Adversarial Network - in the field of Music Generation. The results are there, but we still hope to improve further, so look out for more updates!

tags: CNN, WaveNet, Google

Comments !