Progressive Growing of Sound

Posted by Piotr Kozakowski & Bartosz Michalak on Tue 30 January 2018

We’re still working!

Hello back everyone! Just a short note at the beginning - we’re not done yet with our Master’s Thesis! (You can decide whether it’s good news or not…) Working hard on our internships through the summer we didn’t have much time to think about DeepSound project, but after summer we got back with plenty of new ideas, some of which failed dramatically, some of which are promising and some need to be shown right away!! But first, a quick recap:

Past experiments with GANs

As you might have noticed from our previous posts, authors of this blog, though tightly cooperating, had their attention spread to two main areas - Piotr focused on SampleRNN, which led to its reimplementation in PyTorch, while Bartek’s thoughts circulated around potential use of GANs for generating audio. While SampleRNN was very promising from the beginning (and with our new ideas, it may get even better - more about that later), GANs failed miserably in outputting spectrograms useful for sound reconstruction. Reasons for this were few: apart from the big problems with training, it just failed to capture details and small artifacts it left on the images rendered even semi-convincingly looking spectrograms were useless in terms of sound reconstruction. And even if we somehow managed to get rid of those, the maximum resolution we could achieve was 128x128px, which is still too low frequency resolution for STFT to make it usable in sound reconstruction (you would hear the a very highly modulated sound of a song, resembling the original instrument - e.g. piano - to a very low extent. ).

Recall, it was just a year ago - a lot of GAN training techniques and architectures were still yet to be invented. Though we tried a few other things, GAN-based idea for generating sound was left behind.

Up until just recently…

Progressive growing of GANs


Among about a bazillion of GAN papers throughout all the 2017, there is one big winner, which presents results far greater than anything else to this day in terms of the level of detail - the famous NVIDIA’s Progressive growing of GANs paper, which shows that a simple yet brilliant idea is sometimes superior to many complicated and sophisticated methods of training and architecture design. And the results truly are astounding:

The enormous resolution of 1024x1024 with great level of detail and a stable and reliable training procedure. This is the summary of the paper and the original authors’ implementation. The idea of progressive growing is very simple: let’s say we want to generate 1024x1024px - that is 2^10 x 2^10px - images. We first downsample our dataset to pictures 2^2 x 2^2px (4x4), through every power of 2 (8x8px, 16x16px etc), up until original 2^10x2^10 resolution. Then we train GAN with procedure described well on these pictures, from the aforementioned paper, ch. 2, figs 1 and 2:

We will describe just the procedure for the generator here, one for the discriminator is trivially symmetric. It’s easily described by induction:

Base (i = 1):

  • Start with just one layer, generating pictures 2^INIT_RES x 2^INIT_RES (e.g. 4x4).
  • Train it for N_train_img images.

Inductive step (i > 1):

  • Stack a new layer on top of the previous one (resulting in i-layer network), generating pictures one order of magnitude larger (e.g. in the first such step it would be 4x4 -> 8x8).
  • Train the network for N_transition_img images, using the upsampled output from the previous layer and current layer as a residual block, with parameter alpha, being the weight of the current layer vs previous layer’s output, increasing linearly from 0 to 1, with 1 achieved when N_transition_img’s have been shown.
  • Train the network normally for N_train_img images.

BTW. N_transition_img and N_train_img correspond to lod_transition_kimg and lod_training_kimg parameters in the original implementation and are pretty important parameters - more on that later.

And that’s the main trick. Really. There are of course other things, which lead to stable training and great results, namely:

  • sensible loss function: WGAN-GP, about which you can read here, which stands for “Wasserstein GAN - gradient penalty”, also called “Improved Wasserstein” (hence “iwass…” parameter names in NVIDIA’s implementation). WGAN require the discriminator (or, to be precise: the critic) to be 1-Lipschitz function and try to accomplish that with weight clipping. Here, authors show why it is unhealthy for the network and suggest adding penalty for discriminator’s gradient (w.r.t. datapoint mixes of generated and true samples) diverging from 1 (or, for it to be different k-Lipshichtz function: from k, or iwass_target in NVIDIA’s implementation).
  • Weight scaling: authors claim (chapter 4.1 of the paper) to initialize network with N(0,1) and then scale the weights dynamically, by dividing by a constant from the well-known paper about weight initialization by He at al. This is supposed to equalize the learning rate for every parameter, and hence improve training. For more details, we encourage to read the paper.
  • Pixelwise feature normalization in generator: which is just normalization across channel axis, so that for every pixel the feature vector has unit variance.
  • LeakyReLU activations

Using all of the tricks above, with the original idea of progressive growing, we achieve great resolution with a tremendous accuracy of details…

Progressive Growing of Spectrograms

...which is exactly what we needed during our first attempt with generating spectrograms! So, as we’ve said multiple times, let once again results speak for themselves:

Ladies and gentlemen, you have witnessed the first - to our best knowledge - successful result of generating audio signal using Generative Adversarial Networks. Thank you for your attention.

The experiments details

Dataset and reconstruction:

  • piano dataset, with link in our previous post <>_, 5-second parts sampled with 16000Hz
  • Spectrograms: 512x512 spectrograms, obtained with STFT with n_fft = 1024 and hop_length = 128
  • The sound was reconstructed in a 1000 iterations of Griffin-Lim phase reconstruction algorithm

Basically this is all you need to feed in the original NVIDIA’s code to get results similar to ours. The rest of parameters for their code is left default, except for:

  • lod_training_kimg and lod_transition_kimg were set to 150
  • the reasonable quality is about 1400 kimg total, we trained for at most about 2000 kimg

Machine we used is Titan X, with 10 GB memory. During the last part of the training, it takes about 9.5GB of GPU’s memory.

But what would be our contribution here, if we just used NVIDIA’s code?

Our code

We present a reimplementation of NVIDIA’s paper in PyTorch: take a look at our code here. We tried to keep it similar in terms of patterns used to our SampleRNN implementation, reusing some of (actually, kinda outdated and deprecated) PyTorch’s utilities like Trainer or Dataset. Anyway, our code brings you:

  • all the default parameters set to more-or-less resemble our training procedure
  • Datasets for both Image generation and Sound generation through spectrograms (and few other modes we may talk about in future posts!)
  • utility to generate samples from generator network saved from PyTorch
  • better modularization and extension ability

We want to highlight, that this code is still very much in constant development, followed by lots of experiments, with unpublished results yet and it may be buggy! We also want to acknowledge a few other implementations that were a big inspiration and provided some hints or ideas on how to implement certain parts:

Our code provide a very high-level abstraction on what is a Progressive Growing network - it just requires it to utilize depth and alpha attributes. Same goes for the Dataset class. Progressive growing is handled by the DepthManager plugin, which ensures that after certain points, the network should be deepened and sets alpha appropriately. What’s more, by removing DepthManager plugin, you get a framework for training most of GANs with WGAN-GP loss. This can even be achieved with this very code, setting the depth and alpha properly for the models provided.

What’s more our code handles original paper’s h5py-based datasets. It only takes you to provide a few parameters to the OldH5Dataset. One thing arising entirely from Bartek’s laziness of writing command line tools line-by-line is a funky and obviously entirely undocumented way of providing parameters for training: you can virtually pass any parameter of any class that is constructed on the way. As the work is still in progress and a few concepts may be redesigned on the way, possibly with PGGAN and our spectrogram-generating-parts separated, we will drive into detail in a future post.


PGGAN - progressively growing GAN - seems to be a general enough with great performance for a dataset, which is very homogenous and “easily upscalable” (whatever the latter formally means) - take note, that this training procedure actually assumes, that the inherent local minima that the networks fall into, “conditioning” newer, finer-grained layers on the previous, less-detailed ones, are actually also (close to) global minima - and it turns out it’s a good and sensible assumption, which is called by the authors “increasing the difficulty of the task gradually”. We believe it may be a way to go for a lot of other future architectures for GAN and will surely try to research this area broadly.

Next, we believe that PyTorch is well suited framework for this task - trying to understand Theano code, with all the LODSelectionLayer funkiness, loads of input and output layers - is both difficult and annoying. Essentially, even though we highly value the original implementation - it’s a really good piece of software, extensible, modularized and clean, contrary to what many researchers publish with their papers - we argue that the dynamic creation of the computations graph will make thinking and designing future Progressively Growing architectures much easier and a whole lot easier to debug. Alas, our implementation is a little bit slower - we are still investigating it, but the overall decrease in performance is about 20%, which means a few more hours of training.

Finally, we believe that our work here, standing on the shoulders of giants (which, obviously, means reusing someone’s code or ideas), opens a large area of new possibilities in sound generation - as we again want to highlight, this is (to our knowledge - please enlighten us if we’re wrong!) the first successful sound signal generation using Generative Adversarial Networks.

Let us know what you think or what you have accomplished using our code!

Comments !