Training a classifier using VGGish embeddings

Sleeba Paul

unread,

Aug 28, 2018, 7:46:56 AM8/28/18

to audioset-users

Hi Group,

I would like to request a help from the community. This is my problem.

Say, I've to classify between three categories for an inputted audio buffer.

1. Explosion

2. Gunfire

3. Others

All these categories are available in the audio data set. I've fetched the data and preprocessed them to 2 seconds long clips with 16 kHz sampling frequency. These clips are passed to VGGish to generated embeddings of size (2,128).

I would like to add another lite classifier to be trained in these embeddings to classify the above-mentioned labels. What would be the architecture of this lite network? I'm interested in using MobileNet / SqueezeNet for on-premise prediction. But I'm worried about the data input as I can't treat a (2x128) as an image.

Please help!

Warm regards,

Sleeba Paul

Wei Fan

unread,

Sep 5, 2018, 10:54:00 AM9/5/18

to audioset-users

The VGGish model takes in about 1 sec clip to generate 128-d embeddings. Your clip is 2 sec long hence you get 2*128 embeddings, each represents 1 second of the audio content. You may pass the 2 128-d vec one by one to the light-weight classifier, and find a way to ensemble the 2 output. For example, if any one is classifier as explosion/gunshot, consider the full clip is explosion/gunshot. The reason to do this is explosion/gunshot are generally short-lasting sound.

Regarding the architecture of the lite classifier on top of embeddings, I haven't explore on this much. From what I have tried, even a single layer perceptron works well. RF works too. Appreciate if anyone knowing better architecture for the light-weight classifier.

Thanks,

Wei

Carlos Miguel Quintos

unread,

Sep 11, 2018, 5:50:40 AM9/11/18

to audioset-users

Hi Sleeba,

How did you fetch specific data( Explosion, gunfire, etc) from the audioset?

Sleeba Paul

unread,

Sep 11, 2018, 12:12:17 PM9/11/18

to audioset-users

Okay, I had no clue about extracting specific sounds from the audio data set. Then I found this Kaggle Competition.