VGGish model released

2,503 views
Skip to first unread message

Manoj Plakal

unread,
Aug 8, 2017, 5:26:16 PM8/8/17
to audioset-users

The Sound Understanding team at Google is happy to announce that the trained model used to generate AudioSet embeddings is now available.

The model, which we call "VGGish", is available at https://github.com/tensorflow/models/tree/master/audioset

This release contains:
- the VGGish model definition in TensorFlow (Slim)
- Python code to compute log mel spectrogram features from waveform
- Python code to post-process the embeddings from the model and apply PCA/quantization
- associated model checkpoint and PCA parameter files
- demo code showing how to use the model in inference and training modes

As mentioned in the README, please use the mailing list for general questions, and use the tensorflow/models issue tracker for specific technical issues (and make sure to @-mention or assign issues to @plakal and @dpwe to get our attention).

We are looking forward to how the community will use VGGish and AudioSet! 

Manoj,
on behalf of Sound Understanding @ Google
Message has been deleted

yu-cha...@t-online.de

unread,
Sep 18, 2017, 6:37:32 PM9/18/17
to audioset-users
Hi Manoj,

I have a questions about setting hyperparameter of feaure extraction. Could you tell me, how did you select value for the following parameter? just from experience?
NUM_FRAMES = 96  # Frames in input mel-spectrogram patch.
NUM_BANDS = 64  # Frequency bands in input mel-spectrogram patch.
STFT_WINDOW_LENGTH_SECONDS = 0.025
STFT_HOP_LENGTH_SECONDS = 0.010

Best Regards
Yu Changsong

Manoj Plakal

unread,
Sep 21, 2017, 4:05:34 PM9/21/17
to audioset-users

FYI, the tensorflow/models Github repository was just reorganized and a number of models, including AudioSet, were moved into a 'research' folder.

The AudioSet model code can now be found here:


Dan Ellis

unread,
Sep 21, 2017, 4:13:33 PM9/21/17
to yu-cha...@t-online.de, audioset-users
Yu - 

The 25 ms window / 10 ms hop is inherited from speech recognition (which means optimized for speech spectra and phoneme durations, which is not particularly relevant to audio events, but has worked out OK).

Using 64 mel bands instead of the more customary 40 was basically to get a power of 2 which is a little "cleaner" for the factor-2 downsampling in the CNN.  More spectral resolution seems useful, but with the normal mel spectrum implementation you don't want to go too fine (and risk aliasing against the FFT bins).

We chose a ~1 sec window somewhat arbitrarily: Originally, we were using ~200 ms input patches, but wanted a more generous time context to be able to make use of wider time structure.  But there are diminishing returns for very large windows.  We went with 96 frames of 10 ms rather than exactly 100 so we could decimate by 2 five times and still get an integer size.

I would say we've informally investigated each of these choices without finding anything that makes a startling difference, but it's quite possible there's something we're missing.  I'd be very glad to see more systematic and quantitative investigation.

  DAn.


--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/6d148131-2e7b-4fe0-928c-febf2dc138b3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Peter Trajmar

unread,
Nov 9, 2017, 8:09:19 PM11/9/17
to audioset-users
I have a somewhat related question to this topic.

We see from vggish_params.py:
SAMPLE_RATE = 16000
STFT_WINDOW_LENGTH_SECONDS = 0.025
STFT_HOP_LENGTH_SECONDS = 0.010

This works out to:
window_length_samples:  400
hop_length_samples:  160

And fft_length is calculated as:
fft_length = 2 ** int(np.ceil(np.log(window_length_samples) / np.log(2.0))
fft_length:  512

So a 512 length fft (np.fft.rfft()) is applied to a data array of size 400. My understanding is that fft_length should match size of data being passed. It is not clear to me how np.fft.rfft() behaves when passed fewer samples than fft_size.

I'm generally hoping someone can offer some insight here.

Just to make my confusion more concrete, here are some questions that may help me understand this:
Is this the desired behavior (fft applied to data array smaller than fft_size)?
Why not make the window size a power of 2?
Is it conceptually okay to execute an fft with input size smaller than fft_size? 
What is the behavior of np.fft.rfft() when passed fewer samples than fft_size?

I would greatly appreciate any assistance with this.

Thanks,
Peter


On Thursday, September 21, 2017 at 1:13:33 PM UTC-7, Dan Ellis wrote:
Yu - 

The 25 ms window / 10 ms hop is inherited from speech recognition (which means optimized for speech spectra and phoneme durations, which is not particularly relevant to audio events, but has worked out OK).

Using 64 mel bands instead of the more customary 40 was basically to get a power of 2 which is a little "cleaner" for the factor-2 downsampling in the CNN.  More spectral resolution seems useful, but with the normal mel spectrum implementation you don't want to go too fine (and risk aliasing against the FFT bins).

We chose a ~1 sec window somewhat arbitrarily: Originally, we were using ~200 ms input patches, but wanted a more generous time context to be able to make use of wider time structure.  But there are diminishing returns for very large windows.  We went with 96 frames of 10 ms rather than exactly 100 so we could decimate by 2 five times and still get an integer size.

I would say we've informally investigated each of these choices without finding anything that makes a startling difference, but it's quite possible there's something we're missing.  I'd be very glad to see more systematic and quantitative investigation.

  DAn.

On Mon, Sep 18, 2017 at 6:37 PM, <yu-cha...@t-online.de> wrote:
Hi Manoj,

I have a questions about setting hyperparameter of feaure extraction. Could you tell me, how did you select value for the following parameter? just from experience?
NUM_FRAMES = 96  # Frames in input mel-spectrogram patch.
NUM_BANDS = 64  # Frequency bands in input mel-spectrogram patch.
STFT_WINDOW_LENGTH_SECONDS = 0.025
STFT_HOP_LENGTH_SECONDS = 0.010

Best Regards
Yu Changsong

--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To post to this group, send email to audiose...@googlegroups.com.

Manoj Plakal

unread,
Nov 9, 2017, 8:45:20 PM11/9/17
to Peter Trajmar, Dan Ellis, audioset-users

DAn can explain this better than anyone.


To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/cf32ef5a-65a7-42d6-87de-f4c1770d4f28%40googlegroups.com.

Dan Ellis

unread,
Nov 10, 2017, 7:22:43 AM11/10/17
to Peter Trajmar, audioset-users
On Thu, Nov 9, 2017 at 8:09 PM, Peter Trajmar <ptra...@gmail.com> wrote:

This works out to:
window_length_samples:  400
hop_length_samples:  160

And fft_length is calculated as:
fft_length = 2 ** int(np.ceil(np.log(window_length_samples) / np.log(2.0))
fft_length:  512

So a 512 length fft (np.fft.rfft()) is applied to a data array of size 400. My understanding is that fft_length should match size of data being passed. It is not clear to me how np.fft.rfft() behaves when passed fewer samples than fft_size.

The vector is padded out to the FFT length with zeros.  If we were using the phase of the transform, it would matter at which end the zeros were added, but we're not, so it doesn't.
 

Just to make my confusion more concrete, here are some questions that may help me understand this:
Is this the desired behavior (fft applied to data array smaller than fft_size)?

Yes 

Why not make the window size a power of 2?
 
We want a 25 ms window.  The reasons for this aren't particularly rigid; we inherited this convention from speech recognition, in which you want a short window to capture local variation in the signal, but long enough to smooth out too much fluctuation.  25 ms is a good compromise because it is long enough to smooth across the pitch pulses of typical voiced speech.  But it has also worked well, empirically, in a wide range of audio recognition applications.

The time window does determine the characteristics of the feature.  You don't particularly want it to be too closely tied to the sampling rate.  One nice thing about using a common window duration is that you can, in fact, calculate a comparable Mel spectrum (e.g., the 64 point spectrum we actually use) for audio with different sampling rates (e.g. 11025, 16000, 22500 Hz) without necessarily resampling everything to the same rate first.  We don't do that, but since we're going to remap the spectrum's frequency axis anyway, there's nothing particularly special about using a 2^N window size.

Is it conceptually okay to execute an fft with input size smaller than fft_size? 

Yes.  Zero padding is exactly equivalent to interpolation of the 400 point DFT of a 400 point time sequence to 512 points.
 
What is the behavior of np.fft.rfft() when passed fewer samples than fft_size?

np.fft.rfft(v, fft_size) == np.fft.rfft(np.hstack([v, np.zeros(fft_size - len(v))]), fft_size)
for 1D vector v whose length <= fft_size.

I hope this makes it clear.

  DAn.

To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/cf32ef5a-65a7-42d6-87de-f4c1770d4f28%40googlegroups.com.

Peter Trajmar

unread,
Nov 10, 2017, 6:18:58 PM11/10/17
to audioset-users
DAn,

Thank you very kindly for your thorough responses. I believe you have answered all of my questions and I have a better understanding now.

Peter

Hiro

unread,
Dec 2, 2017, 7:31:15 PM12/2/17
to audioset-users
Is there any way we can get the raw audio files that are labeled? I'm trying to develop my own model and want to use librosa to extract the features from the raw sound files. 

thanks,
Hiro

alex....@bostonfusion.com

unread,
Dec 3, 2017, 4:01:51 AM12/3/17
to audioset-users
Hi,

When I run this model on the AudioSet clips, I get very different numbers. 

In particular, I downloaded this clip of a helicopter: https://www.youtube.com/watch?v=bq6C0_tAbJM&feature=youtu.be&start=30&end=40, truncating between 30 and 40 seconds. I downloaded as mp3, converted to wav, and then ran this command:

python vggish_inference_demo.py --wav_file helicoper.wav --tfrecord_file helicopter.tfrecord --pca_params vggish_pca_params.npz --checkpoint vggish_model.ckpt


The first element of the resulting file is 0.45. This is consistent with other helicopter videos I downloaded and ran through the same process. But, the AudioSet-provided file's first element is -1.58. This is consistent with the other AudioSet-provided helicopters. Am I missing something?

Thanks,

Alex

Abhilash Iyer

unread,
Jul 12, 2019, 11:00:05 AM7/12/19
to audioset-users
Hello everybody, 

I am using the VGGish model to detect sounds and would eventually want to load the inference pipeline to an edge device.The Vggish checkpoint file is ~291MB and then the additional low level classifier adds about 2MB in my pipeline. This is too big. 

1.  I am looking for any links or threads where people have experimented quantizing the vggish model? Any ideas on how to compress the model?

2. Also, can anyone share the architecture of the VGGish?

Thanks
Reply all
Reply to author
Forward
0 new messages