Fast WaveNet Decoder for NSynth?

222 views
Skip to first unread message

Parag Mital

unread,
Apr 11, 2017, 3:39:14 PM4/11/17
to magenta...@tensorflow.org
Dear all,

NSynth is really incredible work!  I can see this work starting a whole line of interesting developments in audio synthesis.  Great job to everyone involved... I am really excited by what it can do.

I was hoping to get some decodings going.  Has anyone tried using a fast implementation of the wavenet decoder for sampling w/ NSynth?

The original decoder:

Example implementation of a fast decoder: 

Will try looking into it myself but curious if anyone has started/has suggestions?

Regards,
Parag

--

Parag K. Mital, Ph.D. / Director of Machine Intelligence
pa...@kadenze.com

Kadenze, Inc. Office: (661) 367-1361
27200 Tourney Rd / Ste. 350 / Valencia, CA 91355

Kadenze and Kannu are trademarks of Kadenze, Inc.

Parag Mital

unread,
Apr 11, 2017, 3:51:12 PM4/11/17
to magenta...@tensorflow.org
The conditioning part seems like it would need to be reworked.  Let's say a 4 second sample @ 16000 Hz has a latent embedding with the Non Causal Temporal Encoder of 128 x 16.  Then the WaveNet Decoder should condition on the appropriate slice of this temporal encoding by for instance upsampling to the length of the audio sample's decoding, e.g. 64000 x 16...

Some relevant text from the blog article:

We condition the vanilla WaveNet decoder with this embedding by upsampling it to the original time resolution, applying a 1x1 convolution, and finally adding this result as a bias to each of the decoder’s thirty layers. Note that this conditioning is not external as it’s learned by the model. Since the embeddings bias the autoregressive system, we can imagine it acting as a driving function for a nonlinear oscillator. This interpretation is corroborated by the fact that the magnitude contours of the embeddings mimic those of the audio itself.

Douglas Eck

unread,
Apr 11, 2017, 6:59:02 PM4/11/17
to Magenta Discuss
Nice to see a good open-source decoder. I'll be curious to see how fast it is once we figure out how to connect it. 

Jesse Engel

unread,
Apr 11, 2017, 8:14:20 PM4/11/17
to Magenta Discuss
That would be awesome. You might want to connect with the authors of the fast sampler (https://github.com/ibab/tensorflow-wavenet/issues/254). FWIW, the code already does the upsampling for you in _condition() (https://github.com/tensorflow/magenta/blob/master/magenta/models/nsynth/wavenet/h512_bo16.py#L64). The nearest neighbor upsampling happens implicitly due to the reshape and broadcast in addition. In one approach, you could start with a latent vector and upsample it the fine scale resolution by adding it to a bunch of zeros and then condition the generation sample by sample on that vector with the same resolution.

Leonardo O. Gabrielli

unread,
Apr 12, 2017, 5:07:54 AM4/12/17
to Jesse Engel, Magenta Discuss
Hi,
I'm maybe a little bit outdated because I started working on this research branch in 2016 when the first wavenet paper came out and later left it for sampleRNN which provides similar results at much lower computational cost (with pros and cons).
There are already a large number of wavenet decoder implementations, but I just can't understand how are people going to handle all the computational power needed by it. Differently from the previous magenta stuff based on MIDI, here you need a lot of resources. We barely generated some flimsy sounds with days of training on a Titan X GPU.
What are other people's experiences? Are you able to generate sounds? Are recent implementation faster?

Nonetheless, Magenta project got it right, I think entangled generation is much more significant to music research than midi stuff and hope to see cool developments on this. I'm writing my opinion regarding this on a Computer Music Journal letter to appear in the next months.

Best regards

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discuss+unsubscribe@tensorflow.org
---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discuss+unsubscribe@tensorflow.org.

Jesse Engel

unread,
Apr 12, 2017, 11:40:39 AM4/12/17
to Magenta Discuss
A difference in this case is that we've already done the expensive training for you. With a single TitanX and a proper sampling algorithm, the model we released should generate around two seconds of audio every minute (batch size 16).

On Apr 12, 2017 2:07 AM, "Leonardo O. Gabrielli" <leonardo.o...@gmail.com> wrote:
Hi,
I'm maybe a little bit outdated because I started working on this research branch in 2016 when the first wavenet paper came out and later left it for sampleRNN which provides similar results at much lower computational cost (with pros and cons).
There are already a large number of wavenet decoder implementations, but I just can't understand how are people going to handle all the computational power needed by it. Differently from the previous magenta stuff based on MIDI, here you need a lot of resources. We barely generated some flimsy sounds with days of training on a Titan X GPU.
What are other people's experiences? Are you able to generate sounds? Are recent implementation faster?

Nonetheless, Magenta project got it right, I think entangled generation is much more significant to music research than midi stuff and hope to see cool developments on this. I'm writing my opinion regarding this on a Computer Music Journal letter to appear in the next months.

Best regards

Leonardo O. Gabrielli

unread,
Apr 12, 2017, 1:29:21 PM4/12/17
to Jesse Engel, Magenta Discuss
Thank you Jesse, this is fine. However, I'm more on the machine learning side of this, so I would be interested more in training the net on our datasets for specific purposes or modify something related to architecture, inputs and so on.
Still a significant contributions to the community, thanks.

2017-04-12 17:40 GMT+02:00 'Jesse Engel' via Magenta Discuss <magenta...@tensorflow.org>:
A difference in this case is that we've already done the expensive training for you. With a single TitanX and a proper sampling algorithm, the model we released should generate around two seconds of audio every minute (batch size 16).
On Apr 12, 2017 2:07 AM, "Leonardo O. Gabrielli" <leonardo.o.gabrielli@gmail.com> wrote:
Hi,
I'm maybe a little bit outdated because I started working on this research branch in 2016 when the first wavenet paper came out and later left it for sampleRNN which provides similar results at much lower computational cost (with pros and cons).
There are already a large number of wavenet decoder implementations, but I just can't understand how are people going to handle all the computational power needed by it. Differently from the previous magenta stuff based on MIDI, here you need a lot of resources. We barely generated some flimsy sounds with days of training on a Titan X GPU.
What are other people's experiences? Are you able to generate sounds? Are recent implementation faster?

Nonetheless, Magenta project got it right, I think entangled generation is much more significant to music research than midi stuff and hope to see cool developments on this. I'm writing my opinion regarding this on a Computer Music Journal letter to appear in the next months.

Best regards
Reply all
Reply to author
Forward
0 new messages