DCGAN and spectrograms

Posted by Piotr Kozakowski & Bartosz Michalak on Mon 16 January 2017

Another approach

While Piotr investigated capabilities of WaveNet in terms of generating music, Bartek tried another hot topic - Generative Adversarial Networks, specifically DCGAN architecture. As our project's purpose is purely to create a neural network with the ability to create music, it is quite obvious any new (and fancy) generative model is of our interest and surely one of the most popular is DCGAN - Deep Convolutional Generative Adversarial Networks.

Dear reader, did you notice the Deep Convolutional prefix? As you may have heard or already noticed, DCGAN achieved great results with generating images. Images. And how to represent sound as an image? Well, if you read our amazing post about musical genre recognition, the first thing that should pop into your mind should be the term spectrogram. At least it was ours.

Generating spectrograms - how and why

The idea of generating spectrograms arose mainly due to our previous experiments with them - we actually know how they represent the signal and they are quite easy to interpret. Another quite important feature of spectrograms in terms of generating music is the presence of inverse short-time Fourier transform, which actually lets us get back the signal from the spectrogram matrix.

Wait a second dear sir! Spectrograms contain only magnitude of the signal, but this way we are going to forget about the phase!

Sigh You almost got me! The last, but not least feature of musical signal that we wish to emphasize is the fact, that most information is held in the magnitude, not in the phase. If you do not believe me, take a look at this quick example:

A sample song from the GTZAN dataset, that happens to resemble live version of Bohemian Rhapsody by Queen:

The very same song quickly reconstructed from the real part of the STFT matrix (imaginary part set to 0):

And lastly from the magnitude (random phase with uniform distribution from -pi to pi):

As you can hear, even though the music is highly affected by the lack of proper phase, we still can both distinguish it and treat the result as music - maybe just a weird space-robots kind, but surely still music. Of course, even though that would actually be something, we aim at imitating somewhat more human compositions. But let's say we are able to generate the spectrograms - then the task of optimizing phase given a magnitude seems not to be an impossible task. Honestly, we are not aware if it can be done just algorithmically (due to some properties of musical signal we don't know about), but even if it isn't - we could probably just make another neural network to solve this problem. As the phase is not that important in general, that still does look as an achievable goal.


One of the most interesting ideas of the recent years in deep learning based generative models are Generative Adversarial Networks. Each model is actually a combination of two neural networks - a Discriminator (D) network and a Generator (G) network. Discriminator tries to guess whether the input was from the data distribution (is from the real data) or from the generated data distribution (is a fake one). The Generator is given some random vector (“code”) from the latent space and tries to generate data based on it that would resemble the real data as much as possible. How does the Generator know if the output it produced is less or more similar to the real ones? From the Discriminator of course!

The training process is actually a game: discriminator is given real data (with label 1) and fake data (with label 0) and learns to distinguish between them. Generator is given output from the Discriminator (it's "opinion" on how similar is the generated data to the original distribution) and learns how to fool the Discriminator, trying to generate data which the Discriminator would classify as real (label 1). This game theoretically leads to the Generator being able to generate examples from the original data distribution based on random noise. Details of GAN architecture can be found in this paper, section 3.

The DCGAN combined the idea of GAN and convolutional networks, which are known to be great at solving computer vision tasks, and created a generator capable of generating some quite realistic images with interesting properties (good description of which can be found in the readme from the official repo.

The quickest way to get to know the actual DCGAN architecture is looking at the code (lines 63-101). As we can see, Generator is more or less Discriminator backwards: Generator takes as input a nz dimensional code vector, which upsamples to 8 * ngf (model hyperparameter, meaning the final number of feature maps in Generator) feature maps, each of 4x4 picture. It then performs Batch Normalization, upsamples (using deconvolution, which in neural networks terminology is just a transposed convolution, sometimes called upconvolution) it both in width and height 2 times, but reduces the number of feature maps by the factor of 2. The "state size" comments pretty much illustrate the dimensions after each layer.

Bedrooms generated by the DCGAN architecture (taken from the official repository):

bedrooms generated by DCGAN

On choosing the implementation and GPUs getting old

The methodology of work in our project is pretty straightforward - since we have only two years to complete our Master's thesis, the first year is mostly about running some working code with some actual results to reduce the amount of time we would have spent on implementing the same thing on our own. This time, we chose this implementation of DCGAN, written in Torch.

Torch is quite nice framework, but we experienced a lot of trouble installing it on a server without root access and still tend to stumble upon some inexplicable luarocks troubles (the most recent being having problems installing rnn package for Torch). Honestly, virtualenv and pip seem to be a lot easier to maintain, work out of the box most of the time and Python is a much more popular language. What do you think are the pros of using Lua and Torch over Python and some Python frameworks?

The only reason for choosing Torch was that it actually worked on the server we use, because of the issues with Theano's Batch Normalization and lack of TensorFlow overall. Both of the mentioned use CuDNN which cannot run on our GPUs due to the already described reasons. And we had already had Torch installed because of some minor past projects. After removing the Batch Normalization, the network seemed not to learn anything at all - the Batch Normalization was also mentioned as quite important in the original paper.

We honestly hope we will be able to skip this hardware-pain subsection in our future posts.

The effects and conclusions

During the training of the DCGAN we used a few different configurations, each resulting in a slightly different output of the network.

Dataset info:

  • 2890 spectrograms
  • 128x128 px
  • representing a 5s part of our piano dataset
  • sampling rate: 16000Hz
  • STFT parameters: dft_size: 256, hop_length: 128

The DCGAN architecture modifications: in all experiments, we removed code for random cropping and flipping the images, as that is not something we feel spectrograms should be preprocessed with.


Some selected results:

1. First attempt was quite different from the others: on pure DCGAN architecture, using colored versions of 512x512px spectrograms, cropped by the DCGAN to 64x64 After 100 iterations:

First architecture after 100 iterations.

After 500 iterations:

First architecture after 500 iterations.

The results were quite promising: unfortunately, our experiments shown that the resolution 64x64 px is too low to retrieve any meaningul signal from the spectrogram. We needed to increase the resolution of pictures consumed by the model. Another thing we wanted to get rid off are unnecessary color channels - grayscale is just as good at representing spectrograms and music can be retrieved from such pictures easily.

  1. DCGAN architecture - grayscale, 128px. Adding another layer to both the Generator and Discriminator, maintaining the whole architecture concept (doubling the feature maps etc). Experiments shown, that this way the greatest resolution achievable is 128x128px, due to memory limitations on the GPUs.

After 406 iterations:

Second architecture after 406 iterations.

This architecture seemed not to learn anything better than that.

  1. Own architecture - grayscale, 128px, 1D convolution with expanding to 128x128px in the generator in the last deconvolution layer. The number of feature maps does not scale by the factor of 2, but was set so that the highest number of feature maps is still same, compared to other experiments.

The code of Generator:

netG = nn.Sequential()
-- input is Z, going into a convolution
netG:add(SpatialFullConvolution(nz, ngf * 4, 4, 1))
netG:add(SpatialBatchNormalization(ngf * 4)):add(nn.ReLU(true))
-- state size: (ngf*16) x 4 x 1 -- width x height
netG:add(SpatialFullConvolution(ngf * 4, ngf * 2, 4, 1, 2, 1, 1, 0))
netG:add(SpatialBatchNormalization(ngf * 2)):add(nn.ReLU(true))
-- state size: (ngf*8) x 8 x 1
netG:add(SpatialFullConvolution(ngf * 2, ngf, 4, 1, 2, 1, 1, 0))
-- state size: (ngf*4) x 16 x 1
netG:add(SpatialFullConvolution(ngf, ngf, 4, 1, 2, 1, 1, 0))
-- state size: (ngf*2) x 32 x 1
netG:add(SpatialFullConvolution(ngf, ngf, 4, 1, 2, 1, 1, 0))
-- state size: (ngf) x 64 x 1
netG:add(SpatialFullConvolution(ngf, nc, 4, 128, 2, 1, 1, 0))
-- state size: (nc) x 128 x 128

The training batch sizes were 64 and 256, with 64 outperforming the 256 one in results. The model with batchSize = 64 was trained in two configurations: with 196-dimensional latent space (nz=196) and 512-dimensional latent space.

Dim = 196, after 137 iterations:

Third architecture after 137 iterations with 196 dimensional latent space

Dim = 512, after 191 iterations:

Third architecture after 191 iterations with 512 dimensional latent space


Conclusions as always are going to be of two kinds: whether it produced good results and whether it was worth the effort.

Results quality is far from perfect. Even though the third architecture actually produced something we could have thought are spectrograms, the music retrieved does not resemble anything near piano. Here is a sample of what has been reconstructed from these spectrograms:

some sample generated spectrogram

And of course, retrieving audio from spectrogram images is kind of silly - what we should do is to feed spectrograms into the network (the full matrix, without quantisation of power to 8-bit grayscale). But the problem remains: the DCGAN overall seems to lack precision - as with the original outputs, the generated images "look good from afar", while after closeup we clearly see that there are some artifacts we cannot ignore, that mess up the whole signal. But who knows - maybe feeding the net with actual STFT matrices, further dimension increase, greater resolution and other hyperparameter tweaks would actually result in some good outputs? Well, here comes the second reflection about the GANs overall.

It’s pretty simple: GANs are extremely hard (and annoying) to train. Reasons for this are few:

1. Evaluation. You won't notice any really meaningful progress reported by the losses on Generator and Discriminator. You can tell when it is really bad - but you can never be sure, that small loss on Generator actually results in better Generator - maybe it was the Discriminator that just wandered off for a while in the space of parameters? The only way to actually check the results is to print out the pictures and look at them.

2. High instability. GANs being unstable is actually common knowledge with recent papers from OpenAI about methods how to tackle this problem. Maybe, at some point at the future, the GANs are going to have some reliable way of training and a set of parameters known to work. But right now, training them with a lot of parameters is just painful - using the first architecture (the one that resembled the most the original one!), a lot of time needed to be spent on tweaking hyperparameters to avoid Generator's loss going up to infinity and producing just random noise. Some notes on what we did:

  • Reduce learning rate of both G and D to at most 0.00001
  • Increase Beta parameter of Adam optimizer to at least 0.7
  • Increase the batch size (note that increasing the batch size leaded to higher stability, but in the third architecture it decreased performance!)

We also created our own heuristic on how to keep the training process somewhat more stable. The trick is easy: every time when Discriminator's loss is at least 10 times lower (in terms of log-likelihood of the data), than the Generator's loss, kick the Generators learning rate 2 times. Do it up to 5 times (resulting in maximum of lr increased by factor of 32), to avoid increasing it infinitely.

if errD and errG and errG > 10 * errD and mm < 5 then
    print("Kicking G learning rate.")
    optimStateG.learningRate = optimStateG.learningRate * 2
    mm = mm + 1
    optimStateG.learningRate = opt.lr
    mm = 0

Intuition behind it is whenever Discriminator starts to be resistant to Generator's efforts to fool him, Generator may just be stuck in some local minimum. That way, Discriminator is still learning well and is further and further outperforming Generator, so it cannot really learn anything and often coming back to random noise. If we kick Generator's learning rate, we may get it out of the local minimum and our experience kind of proves this theory. However, even the most stable models can suddenly fall back to random noise after even 400 iterations, which surely can discourage one to train these networks.

What next?

The GAN adventure has been interesting - this is arguably one of the most extraordinary kinds of neural networks nowadays, consisting always of two models, highly unstable, yet maintainable if used with care and good sense of what is actually going on there. There are a lot of things we didn't check with GANs, such as more hyperparameters tweaks, different kind of data fed into them and using some recurrent layers to tackle the problem. All of which, combined with our GPUs being too old for CUDNN and problems with Lua, can be really time consuming - which we, unfortunately, cannot afford.

We still have hope for the GANs, as we clearly see they are "not that far" from producing quite realistic results, which then may be smoothed by some simple algorithmic heuristics. As we described earlier, this part of the project is meant to find all the cool models that actually work and dive into them when we feel like it's a safe option.

And maybe we have just found one.

No, we are not talking about GANs. If you want to learn more, come back in a while to hear some actual music!

Comments !