Deep learning, part IV: Deep dreams of music, based on dilated causal convolutions

As many neuroscientists, I’m also interested in artificial neural networks and am curious about deep learning networks. I want to dedicate some blog posts to this topic, in order to 1) approach deep learning from the stupid neuroscientist’s perspective and 2) to get a feeling of what deep networks can and can not do. Part I, Part II, Part III, Part IVb.

One of the most fascinating outcomes of the deep networks has been the ability of the deep networks to create ‘sensory’ input based on internal representations of learnt concepts. (I’ve written about this topic before.) I was wondering why nobody tried to transfer the deep dreams concept from image creation to audio hallucinations. Sure, there are some efforts (e.g. this python project; the Google project Magenta, based on Tensorflow and also on Github; or these LSTM blues networks from 2002). But to my knowledge no one had really tried to apply convolutional deep networks on raw music data.

Therefore I downsampled my classical piano library (44 kHz) by a factor of 7 in time (still good enough to preserve the musical structure) and cut it into some 10’000 fragments of 10 sec, which yields musical pieces each with 63’000 data points – this is slightly fewer datapoints than are contained by 256^2 px images, which are commonly used as training material for deep convolutional networks. So I thought this could work as well. However, I did not manage to make my deep convolutional network classify any of my data (e.g., to decide whether a sample was Schubert or Bach), nor did the network manage to dream creatively of music. As most often with deep learning, I did not know the reasons why my network failed.

Now, Google Deepmind has published a paper that is focusing on a text-to-speech system based on a deep learning architecture. But it can also be trained using music samples, in order to lateron make the system ‘dream’ of music. In the deepmind blog entry you can listen to some 10 sec examples (scroll down to the bottom).

As key to their project, they used not only convolutional filters, but so-called dilated convolutions, thereby being able to span more length-(that is: time-)scales with fewer layers – this really makes sense to me and explains to some extent why I did not get anything with my normal 1d convolutions. (Other reasons why Deepmind’s net performs much better include more computational power, feedforward shortcut connections, non-linear mapping of the 16bit-resolved audio to 8bit for training and possibly other things.)

The authors also mention that it is important to generate the text/music sequence point by point using a causal cut-off for the convolutional filter. This is intuitively less clear to me. I would have expected that musical structure at a certain point in time could very well be determined also by future musical sequences. But who knows what happens in these complex networks and how convergence to a solution looks like.

Another remarkable point is the short memory of the musical hallucinations linked above. After 1-3 seconds, a musical idea is faded because of the exponential decaying memory; a bigger structure is therefore missing. This can very likely be solved by using networks with dilated convolutions that span 10-100x longer timescales and by subsampling the input data (they apparently did not do it for their model, probably because they wanted to generate naturalistic speech, and not long-term musical structure). With increasing computational power, these problems should be overcome soon. Putting all this together, it seems very likely that in 10 years you can feed the full Bach piano recordings into a deep network, and it will start composing like Bach afterwards, probably better than any human. Or, similar to algorithms for paintings, it will be possible to input a piano piece written by Bach and let a network which has learned different musical styles continuously transform it into Jazz.

On a different note, I was not really surprised to see some sort of convolutional networks excel at hallucinating musical structure (since convolutional filters are designed to interpret structure), but I am surprised to see that they seem to outperform recurrent networks for generation of natural language (this comparison is made in Deepmind’s paper). Long short-term memory recurrent networks (LSTM RNNs, described e.g. on Colah’s blog, invented by Hochreiter & Schmidhuber in ’97) solve the problem of fast-forgetting that is immanent to regular recurrent neuronal networks. I find it a bit disappointing that these problems can also be overcome by blown-up dilated convolutional feed-forward networks, instead of neuron-intrinsic (more or less) intelligent memory in a recurrent network like in LSTMs. The reason for my disappointment is due the fact that recurrent networks seem to be more abundant in biological brains (although this is not 100% certain), and I would like to see research in machine learning and neuronal networks also focus on those networks. But let’s see what happens next.

### Update – 30/9/2016 ###

Since I was asked about the piano dataset in the comments, here are a few more words on that topic. First, why did I down-sample the recordings by a factor of seven? mp3 recordings are typically encoded at 44.1 kHz, which is roughly Nyquist times the hearing limit of humans. The higher frequencies, however, are costly, but almost unimportant. For example, frequencies of human speech are well below 44 kHz.

For my dataset, I chose two piano composers of different style, J.S. Bach and F. Schubert. Here is a 10-sec piece by Bach:

And here downsampled to 6.3 kHz:

One can still perceive the structure, and for this project, I was mainly concerned about larger musical structures, not the overtone structures, so this would be totally fine, and I decided for myself that this is roughly the compromise I wanted to make. Now, let’s compress the dynamic range from the standard 16 bit to 8 bit:

This seems still acceptable, although one problem is apparent: the encoding of background silence in is not very good, and it seems as if there is some white noise added up to all frequencies.

Now, let’s have a look at Schubert, first the unperturbed original:

Then downsampled to 6.3 kHz:

And with reduced dynamic range (8 bit):

Now, it becomes clear that 8 bit is not enough if the dynamic range that is used is large – which is definitely more the case for the typically erratic Schubert sonatas than for Bach prelude recordings. Clarity is difficult to encode properly!

I therefore chose to downsample in time by 7, but keep 16 bit in dynamic range. Deepmind (see above) showed how to compress audio to 8 bit without these losses, by using a non-linear mapping of amplitudes from raw data to 8 bit.


Now some details on how I generated the dataset; there are most likely better ways, but here is my improvised solution. I had a couple of folders with relevant mp3s, either Bach or Schubert. In Matlab, I opened each mp3, downsampled the time course by 7, chunked into 10-sec pieces and saved it as a binary mat file for each mp3 file. The 10-sec pieces are not independent, but overlapping by 1 sec in order to generate more training data:


And here is the code:

% list of folders
FolderList = dir();

counter_file = 1;
for jj = 3:9 % here, I had 7 folders with different composers
    % list of mp3s
    FileList = dir('*.mp3');

    bitelength = 6300; % 1.0 sec; one chunck will be 10 sec

    chunk_counter = 0;
    for i = 1:numel(FileList)
        disp(strcat('mp3 #',32,num2str(i),32,'out of',32, ...
            num2str(numel(FileList)),32,'within folder "', ...
        % read mp3
        [y,Fs] = audioread(FileList(i).name);
        % downsample by factor 7
        yy = zeros(floor(size(y,1)/7),2);
        for k = 1:7
            yy = yy + y((7:7:end)-k+1,:);
        yy = yy/7;

        % expected number of 10 sec chunks
        num_chunks = floor(size(yy,1)/bitelength)-9;

        % generate chunked pieces
        chunked_song = zeros(bitelength*10,num_chunks,'int16');
        for p = 1:num_chunks
            piece = yy((1:bitelength*10)+bitelength*(p-1),1);
            % 16bit, therefore multiply by 2^15-1=32767 for signed integer
            piece = piece*32767;
            chunked_song(:,p) = int16(piece);
        % keep track of number of chunks for this folder
        chunk_counter = chunk_counter + num_chunks;
        % save as mat file (can be read in Python and Matlab)
    % go back 
    cd ..
    % write a mat file with metadata about the number of chunks contained
    % in the respective folder
    Chunk_Counter(counter_file).filename = FolderList(jj).name;
    Chunk_Counter(counter_file).chun_counter = chunk_counter;
    counter_file = counter_file + 1;


For training, I wrote a Python script to read in the dataset. Together with the Matlab code, you can find it on Github.

The dataset itself consisted of a complete recording of the piano sonatas by Schubert; the Goldberg variations by Bach with two different pianists; some fugues and preludes by Bach; and then again by Bach some partitas and diverse other recordings. In compressed form, it is ca. 5 GB large.

The deep learning classifier I trained was supposed to learn to assign each 10-sec fragment either to Bach or Schubert; after learning, I would let the network imagine its own music based on its internal structure. However, the network never learned to classify properly, so I had to give up the project.

This entry was posted in machine learning and tagged , , , . Bookmark the permalink.

8 Responses to Deep learning, part IV: Deep dreams of music, based on dilated causal convolutions

  1. Pingback: Deep learning, part III: understanding the black box | A blog about neurophysiology

  2. Pingback: Deep learning, part I | A blog about neurophysiology

  3. Pingback: Deep learning, part II : frameworks & software | A blog about neurophysiology

  4. Peter Leupi says:


    Great article.

    Would you be willing to share your piano dataset? (If possible, the on split into 10-second pieces.)
    It would save me a ton of time, trying to recreate the wavenet results. :)


  5. Hi Peter,
    thanks for your comment! Of course I would be willing to share the dataset, if it can be of help to anybody. The only issue is the size of the dataset – maybe around 10 GB (compressed). But I can certainly send you the code that I used to generate the dataset from mp3 files. I will look into all of this during the weekend and post an update here.

  6. Pingback: Deep learning, part IV (2): Compressing the dynamic range in raw audio signals | A blog about neurophysiology

  7. Pingback: Whole-cell patch clamp, part 4: look and feel | A blog about neurophysiology

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.