Using Magenta nsynth as a feature extractor for audio distance comparisons

jeffre...@flatironsdigital.com

unread,

Feb 20, 2018, 7:10:39 PM2/20/18

to Magenta Discuss

Hey All,

Have been using Audioset for a while and just discovered Magenta. Very cool and thanks for open sourcing!

Question: I have tens of thousands of tracks. I'd characterize them as 15 second song riffs. I'd like to build a similarity engine so that I can select a song, then find songs that sound a like.

Any thoughts on the following:

extract features from my file using something like this: (found on the net)

import os
import glob

from magenta.models.nsynth import utils
from magenta.models.nsynth.wavenet import fastgen


def wavenet_encode(file_path):
    checkpoint_path = './wavenet-ckpt/model.ckpt-200000'
    # Load and downsample the audio.
    neural_sample_rate = 16000
    audio = utils.load_audio(file_path, 
                             sample_length=400000, 
                             sr=neural_sample_rate)
    
    # Pass the audio through the first half of the autoencoder,
    # to get a list of latent variables that describe the sound.
    # Note that it would be quicker to pass a batch of audio
    # to fastgen. 
    encoding = fastgen.encode(audio, checkpoint_path, len(audio))
    return encoding

Then pass the features through an autoencoder to reduce their dimensionality down to 40-80d vectors, then use Euclidean or cosine distance to compute similarity.

Jesse Engel

unread,

Feb 20, 2018, 8:55:55 PM2/20/18

to jeffre...@flatironsdigital.com, Magenta Discuss

Hi Jeffrey and/or James,

You can indeed use the nsynth embeddings for similarity measures, but your milage may vary based on how well the model fits the data (it was trained on single instrument notes). I've done some qualitative experiments for myself and found it picked similar instruments, but haven't done any quantitative comparisons to a baseline like CQT. Here's a post of someone who tried this with t-sne for some sounds: https://medium.com/@LeonFedden/comparative-audio-analysis-with-wavenet-mfccs-umap-t-sne-and-pca-cb8237bfce2f

Best,
Jesse

--
Magenta project: magenta.tensorflow.org
To post to this group, send email to magenta...@tensorflow.org
To unsubscribe from this group, send email to magenta-discuss+unsubscribe@tensorflow.org
---
You received this message because you are subscribed to the Google Groups "Magenta Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magenta-discuss+unsubscribe@tensorflow.org.

Jeffrey James

unread,

Feb 20, 2018, 9:04:32 PM2/20/18

to Jesse Engel, Magenta Discuss

Thanks Jesse,

Ah, good point regarding the original domain being music notes.

re: CQT, you're referring to the constant q transform as another "simple approach" to test against? I tend to see MFCC's thrown around everywhere in the literature as a feature, but hardly come across the CQT.

best,

Jeff

On Tue, Feb 20, 2018 at 6:55 PM, Jesse Engel <jesse...@google.com> wrote:

Hi Jeffrey and/or James,

You can indeed use the nsynth embeddings for similarity measures, but your milage may vary based on how well the model fits the data (it was trained on single instrument notes). I've done some qualitative experiments for myself and found it picked similar instruments, but haven't done any quantitative comparisons to a baseline like CQT. Here's a post of someone who tried this with t-sne for some sounds: https://medium.com/@LeonFedden/comparative-audio-analysis-with-wavenet-mfccs-umap-t-sne-and-pca-cb8237bfce2f

Best,
Jesse

--

Jeffrey R. James

Managing Partner, Flatirons Digital

631-291-6848 | skype: jeffjames83

Jesse Engel

unread,

Feb 20, 2018, 9:10:30 PM2/20/18

to jeffre...@flatironsdigital.com, Magenta Discuss

BTW, this is an area of maybe future research, so I'd be really interested in hearing what you find :).

On Tue, Feb 20, 2018 at 5:55 PM, Jesse Engel <jesse...@google.com> wrote:

Hi Jeffrey and/or James,

You can indeed use the nsynth embeddings for similarity measures, but your milage may vary based on how well the model fits the data (it was trained on single instrument notes). I've done some qualitative experiments for myself and found it picked similar instruments, but haven't done any quantitative comparisons to a baseline like CQT. Here's a post of someone who tried this with t-sne for some sounds: https://medium.com/@LeonFedden/comparative-audio-analysis-with-wavenet-mfccs-umap-t-sne-and-pca-cb8237bfce2f

Best,
Jesse

Jesse Engel

unread,

Feb 20, 2018, 9:12:08 PM2/20/18

to jeffre...@flatironsdigital.com, Magenta Discuss

Yah, I'm referring to the constant Q-transform, any spectral technique really is a good baseline. MFCC's are more typical because they capture some more invariances.

Jeffrey James

unread,

Feb 20, 2018, 9:20:24 PM2/20/18

to Jesse Engel, Magenta Discuss

Right on. Will be happy to share some general observations after I crank through my dataset.

Understood re: spectral techs in general. What seems to be the common way to reduce the dimensionality of the MFCC or CQT?

Typically the MFCC data are 13 (or 20) x (100 * seconds), which is way too high-d for distance measures.

I'm guessing just taking the mean row-wise gets you a 13d vector and then some folks also stack the 1st and 2nd order diffs on 'em (papers are taxing to read as a hacker!).

thx,

Jeff

Jesse Engel

unread,

Feb 21, 2018, 1:20:17 PM2/21/18

to Jeffrey James, Magenta Discuss

I think typically you would break it into smaller chunks, as no algorithm will be incorporating very long term structure anyways. One other thing people do is rather than just comparing all the time bins directly, comparing the best possible alignment of all the time bins to eachother with something called dynamic time warping (https://librosa.github.io/librosa/generated/librosa.core.dtw.html)

Best of luck,
Jesse

Jeffrey James

unread,

Feb 22, 2018, 10:20:48 AM2/22/18

to Jesse Engel, Magenta Discuss

Beautiful, already getting started with the mfcc's + first and second order diffs across a few thousand tracks. Somewhat promising similarities thus far using nearest neighbors (at least for strings, guitar and orchestral pieces). Gotta find a way to benchmark concretely then I'll whip it up with nysnth!

Jeff

On Wed, Feb 21, 2018 at 11:19 AM, Jesse Engel <jesse...@google.com> wrote:

I think typically you would break it into smaller chunks, as no algorithm will be incorporating very long term structure anyways. One other thing people do is rather than just comparing all the time bins directly, comparing the best possible alignment of all the time bins to eachother with something called dynamic time warping (https://librosa.github.io/librosa/generated/librosa.core.dtw.html)

Best of luck,
Jesse

Reply all

Reply to author

Forward

Using Magenta nsynth as a feature extractor for audio distance comparisons - thoughts?

jeffre...@flatironsdigital.com

Jesse Engel

Jeffrey James

Jesse Engel

Jesse Engel

Jeffrey James

Jesse Engel

Jeffrey James