Hi! We will guide you through our process of creating a neural network for music genre recognition. Since the early 2016, inspired by one of the data science courses at our university, we were thinking about combining deep learning and music. One of our early ideas was to create a network recognizing the genre of a song. As the idea developed, we collected more and more resources leading us eventually to the decision to implement our idea during the Braincode 2016 hackathon in Warsaw. We did so and we won! In addition to the network itself, we made a visualization of its current belief of the music genre as the time progresses. The result of our efforts can be seen at our demo - outputs of our network for a few selected songs.
As we want to focus on the work we did ourselves, this blog won't contain tutorials or articles about the basics in the field of deep learning. Examples of such assumed knowledge are: what are convolutional neural networks and what are recurrent neural networks (and why they're awesome).
The first problem we needed to tackle was how we should represent music - notably, it is still a valid problem for our ongoing research. The raw input is surely a .wav file, containing the pure signal - there are no metadata on instruments included, author, etc. The pure signal isn't very handy though - mainly because it's quite heavy. Feeding it to a pure LSTM was an option, but since it's computationally expensive, we abandoned that idea. We contacted a few friends from The Fryderyk Chopin University of Music - Bartłomiej Majewicz and Monika Orzeł - who are both educated in music theory and proficient in signal analysis, on what features might be relevant for music genre recognition.
Inventing features "by hand" for data is a good exercise, even when its structure is so complicated.
- How would you tackle the problem?
- What would be the things you look for, analysing the song?
- Maybe you should look for some particular frequencies?
- Or maybe consider all the frequencies at once?
- Should you interpret all input at once?
- What is the structure of the input?
Such questions led us to the network architecture: we needed to be able to detect frequencies (tones) over short period of time (for example to notice the chords used in a song), but also needed to look on the song as a whole. If we could only plot a frequency distribution over time… As it turns out, we can - and the solution came quite unexpectedly.
Apart from the help from our musician friends, we also told Bartek's close friend from Physics Faculty (thank you, Gabrysia Dzięgiel!) about the project we're currently working on. Telling everyone what you do is generally a very good idea. Sometimes your friends surprise you with a message like "I may have found a recent paper tackling your problem", directing you to Grzegorz Gwardys, Daniel Grzywczak, "Deep Image Features in Music Information Retrieval",. An article describing analysing spectrograms - visual representations of frequency distribution over time - using a Convolutional Neural Network. To solve the genre recognition problem. Perfect.
Dataset and training
Just a quick note about the dataset, which we will refer to later from time to time. Since we wanted to have some comparable results, the selection of dataset was based on the aforementioned paper. The authors use the well-known in the MIR community GTZAN dataset. Even though the dataset is considered 'standard' in music genre classification problem, it has a few flaws, which quite significantly impact the interpretation of the results of any model trained on the GTZAN. In fact, there is even a paper describing in details all the cons of the GTZAN dataset. Some of the problems described there are misclassifications, distortions and duplicates in the data.
Another problem with GTZAN is related not to its construction itself, but the fact that it was made in 2000. Though some may argue, it is quite commonly accepted truth, that during the last 16 years, there actually was some contribution to the musical world overall, particularly to the electronic music. GTZAN doesn’t reflect this contribution - in fact it doesn’t recognize electronic music as a genre at all, which may seem pretty odd nowadays.
What are those spectrograms?
"A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time or some other variable." (Wikipedia)
Sounds good! That would be exactly what we were hoping for. Let's see how it actually looks like:
This spectrogram matches the following piece of classical music from GTZAN30 second piece no. 79 from GTZAN dataset's classical folder:
Do you see the yellow dots? These are the peaks of power over time (horizontal axis) and frequency (vertical axis). It means, that those tones (indicated by the frequency level) are played over that short period of time. Such small dots indicate single keypresses on a piano. On the other hand, chords require several frequencies at once (they are made of multiple tones) and play longer over time (create longer lines over the horizontal axis). Listen to the end of the song: can you notice the chord progression in the end on the spectrogram?
Okay. That's nice: we have an overall look on the song, but it's just one example. Let's look at some blues music and see whether the spectrograms really hold any distinctive features:
The answer becomes pretty obvious. One of the features easily recognizable in blues spectrogram are the drums - the vertical bars on the spectrogram indicate short, but powerful peak of all the frequencies. It is caused by the beat - a feature present in much of popular music, but absolutely absent in classical.
The authors of the mentioned paper used a pre-trained network - a winner of the ImageNet competition - and fine-tuned it on the data. The idea itself sounds pretty odd - since data from ImageNet is nowhere near such abstract pieces as spectrograms.
Better spectrograms - better results
While researching the topic, we stumbled across a blog post by Sander Dieleman: "Recommending music on Spotify with Deep Learning". Another great example of using Convolutional Neural Networks to analyse spectrograms, which reassured us that the way we're going is the right way. One note though: the spectrograms Sander was using were not raw spectrograms, but mel-spectrograms. Let's take a look of the previous examples' mel-spectrograms:
A mel-spectrogram is a spectrogram transformed to have frequencies in mel scale, which basically is a logarithmic scale, more naturally representing how human actually senses different sound frequencies. That was something we didn't know about and probably wouldn't ever find out if it wasn't for recommending music on Spotify! Though the specific numbers are missing, we noticed a significant accuracy jump when we transformed the raw spectrograms into mel-spectrograms.
One last thing: notice that a spectrogram is squared magnitude of the Short Time Fourier Transform of the signal. The most important parameter of the STFT is window length: how long should be the window of time to perform Fourier Transform on. Many thanks to the mentioned friends from Fryderyk Chopin University of Music, who helped us know what is the shortest reasonable period a human ear can distinguish, listening to music. Based on their advice and confirmed in the results, we decided on a 2048 samples long window, which results in about 10 ms period for songs in the GTZAN dataset (sampled at 22050Hz).
Now we know how we should represent the data - it's time to move on to talk about our model.
Convolutional Recurrent Neural Network
The one obvious decision, which is present across both mentioned articles and our work is using convolutional layers to extract features from a song. But let's be honest - we've got a pretty long sequence in which every timestep strongly relies on both the immediate predecessors and long term structure of a whole song. And, after all, it's 2016. LSTM was an obvious choice.
First, we extract features from the spectrograms using convolutional layers. We decided on one dimensional convolutions - just across the time axis. 2-D convolution seems unsuitable in this case: we are interested in changes across time - every convolutional layer should look at a small period of time as a whole, extract the most valuable information and create a feature map that is still a sequence over time. The features are translation-invariant only in time domain - we still need to distinguish between higher and lower frequencies. After each layer we use ReLU activation and 1-D max pooling, which are a pretty safe and reasonable choices.
The resulting sequence of features is then fed to an LSTM layer, which should "find" both dependencies across short period of time, and a long term structure of a song. To regularize the model, we used dropout layers before and after LSTM.
After the LSTM and the following dropout, we feed all the input into a time-distributed fully connected layer with softmax activation, essentially giving us a sequence of 10-dimensional vectors (10 is the number of genres in GTZAN) for each timestep. We would later like those vectors to represent our network’s belief of the music genre at the particular point of time, modelled as probability distributions.
Okay. So right now, we ended up with a time series of 10-dimensional vectors of some probability distribution. And we want them to be genre probability distribution. Obviously, the GTZAN dataset, on which we were training the network, contained just one label per song. Our choice on how to tackle this disproportion problem - inferring music genre per timestep versus just one label for the whole song in the data - was as simple as it could be.
What is an already implemented loss function, suitable for multi-label classification? Categorical Cross-Entropy. Okay, but it takes just two probability distributions: the real one and the predicted one. If we are to use the standard loss function, which at the very least is just a safe choice, we need to aggregate the outputs into just one probability distribution.
We decided on taking the mean across the time dimension. Intuition behind this idea is as follows: it is rather expected for a rock song, to play in a rock genre most of the time. If most of the song classified as rock sounds in fact more like jazz, the classification is probably incorrect. One obvious solution would be to just take an arithmetic mean across time of all the predicted distributions and return it as a final answer. And if you haven’t really come up with anything sensible yet, you really like solutions that seem obvious from the very beginning. So we did this. Note that arithmetic mean of vectors, each representing a probability distribution, returns a vector that is also a valid distribution.
There are other possibilities though. Since LSTM is producing further outputs based on previous predictions, we expect the results to be more and more precise over time. A weighted average based on song length might produce better results, but would be a little harder to implement. The exact form of this weighted average would also be subject to further improvements - the function of time selected to produce weights for predictions would be another parameter to tweak. We didn't have time for such experiments and, what's most important - achieved good results without them.
The GTZAN dataset was split in a 700:300 ratio, for the training and test set respectively. We achieved two main goals with the described network:
Create a model for music genre recognition which works correctly most of the time. Our model achieves 67% accuracy on the test set when comparing the mean output distribution with the correct genre. For comparison, a random model would guess correctly only 10% of the time.
Our result may not appear impressive - according to this presentation, state of the art in recognizing music genre on GTZAN using deep learning approaches was 84% in 2013. However, our model solves a slightly different problem - we don’t just want a single prediction for each track, but a continuous output containing the network’s belief of the genre in every point of time.
We actually sacrificed some of the accuracy to achieve that - before we came up with the continuous output solution, we had a network with a single output, determining the overall genre, and this network achieved 70+% accuracy. The specific results and hyperparameters of that network are lost, but if you want to try to reproduce it using our code, feel free to do so.
Having more fun
The fun with a neural network does not end by the time you trained it to solve some particular problem. That is the moment the real fun actually begins. Neural networks are creating abstractions of the data we often find unexpected or amusing. We just wouldn't be ourselves if we didn't want to check out what the neurons actually learned!
Inspired by the DeepVis Toolbox, which, among other things, can visualize convolutional filters learned by different neurons of a CNN, we also made a kind of “visualization” of the neurons in convolutonal layers of our network. We did so for every neuron by extracting chunks of music pieces from GTZAN, which result in highest activations of that neuron. We then concatenated a few of these chunks, resulting in sounds listed below. Try listening to every one of them - can you hear the common theme? The names we gave to those filter sounds are very subjective - consider them hints - we are not experts in music, so we often cannot describe accurately what a sound really is. If you come up with better names for those filters, please let us know in the comments!
Convolutional layer 1
Convolutional layer 2
Convolutional layer 3
As with the CNNs for computer vision, our convolutional filters seem to learn more abstract features in higher layers. In the first convolutional layer, all chunks for a filter sound similar, like variations on the same sound. In the second one, the filters are somewhat more sophisticated - for example, in the “Vowels” filter we can hear vowels. These are distinct sounds, there are many different vowels and they are sung by both men and women, but they are all definitely vowels. Finally, the third layer contains the most complex filters, like “Pop”, which seems to detect particular female vocals characteristic for pop music.
See you later!
If you have any questions, please leave a comment!
Now we will leave behind genre recognition for good and switch to the point of this research: music generation. We will try to update on the progress as often as possible!