VGGish vs. Speech Commands dataset for sound rcognition

426 views
Skip to first unread message

Nikolay Starikov

unread,
Oct 2, 2017, 9:48:50 AM10/2/17
to audioset-users
Dear colleagues,

First of all thank you very much for your research in audio/sound recognition!

Could you clarify for me the following question. I see in Tensorflow github 2 streams for sound recognition:

1. A simple one approach using speech commands dataset and based on CNN for Small-footprint Keyword Spotting paper. (spectrogram is used).

2. More complex approach using VGGish CNN model. Log mel spectrogram is used (without mmfc I guess).

I'd like to train and test my own dataset in order to recongize car makes by their engine/motor sounds.

Could you please advise me what approach is best suits for the task? Will Google continue to develop simple sound models or will concentrate on large-scale one or maybe CRNN?


Best regards,

Nikolay Starikov

Dan Ellis

unread,
Oct 2, 2017, 12:24:10 PM10/2/17
to Nikolay Starikov, audioset-users
Nikolay - 

The two models you mention come from different target applications.  Keyword spotting is concerned with speech recognition, where the network is trained to recognize a single spoken keyword or phrase over a wide range of speakers, speaking styles, and background conditions.  The CNN in that work is derived from the ones commonly used in speech recognition, with relatively few convolutional layers (e.g. 1 or 2) and relatively large convolution kernel size (e.g. 20x8 time-frequency cells). 

The VGGish model is aimed at generic sound recognition, thus not specialized for speech or phoneme sequences. Like the KWS model, it uses a log-amplitude mel-frequency spectrogram as input, although with greater frequency resolution (64 not 32 bands).  The VGGish model is inspired by work in image recognition, and uses a larger number (e.g. 4) of narrower (e.g. 3x3) convolutional layers.  Unlike the KWS model, it has not been particularly optimized for computational efficiency, although this would be a natural thing to investigate.

On the face of it, the general sound classification task sounds like a better match for your application, although it would probably make sense to try a range of architectures.

  DAn.



--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/23fbf2a7-2b2b-4fdc-8822-c02e5bc4d9df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nikolay Starikov

unread,
Oct 12, 2017, 8:44:37 AM10/12/17
to audioset-users

Thank you very much Dan, for your answer. 

May I ask you 2 more questions: does CNN model based on simple audio dataset (speech_command) trasfer wav. file to spectrogram or to mfcc for futher processing? In input_data.py It was told that it "Creates a graph that loads a WAVE file, decodes it, scales the volume,

    shifts it in time, adds in background noise, calculates a spectrogram, and

    then builds an MFCC fingerprint from that", but in train.py code i didn't fine  transformations of wav files to mfcc.  

For my dataset (sound, not speech) i would like to transfer wav. to mfcc and then to any TFReader format. Am I right that the simplest way is to use preprocess_LibriSpeech.py code that convert my audio to trrecords?

Best regards,
Nikolay
___
 
Nikolay - 
The two models you mention come from different target applications.  Keyword spotting is concerned with speech recognition, where the network is trained to recognize a single spoken keyword or phrase over a wide range of speakers, speaking styles, and background conditions.  The CNN in that work is derived from the ones commonly used in speech recognition, with relatively few convolutional layers (e.g. 1 or 2) and relatively large convolution kernel size (e.g. 20x8 time-frequency cells). 

The VGGish model is aimed at generic sound recognition, thus not specialized for speech or phoneme sequences. Like the KWS model, it uses a log-amplitude mel-frequency spectrogram as input, although with greater frequency resolution (64 not 32 bands).  The VGGish model is inspired by work in image recognition, and uses a larger number (e.g. 4) of narrower (e.g. 3x3) convolutional layers.  Unlike the KWS model, it has not been particularly optimized for computational efficiency, although this would be a natural thing to investigate.

On the face of it, the general sound classification task sounds like a better match for your application, although it would probably make sense to try a range of architectures.

  DAn.


On Mon, Oct 2, 2017 at 9:48 AM, Nikolay Starikov <nicho...@gmail.com> wrote:
Dear colleagues,

First of all thank you very much for your research in audio/sound recognition!

Could you clarify for me the following question. I see in Tensorflow github 2 streams for sound recognition:

1. A simple one approach using speech commands dataset and based on CNN for Small-footprint Keyword Spotting paper. (spectrogram is used).

2. More complex approach using VGGish CNN model. Log mel spectrogram is used (without mmfc I guess).

I'd like to train and test my own dataset in order to recongize car makes by their engine/motor sounds.

Could you please advise me what approach is best suits for the task? Will Google continue to develop simple sound models or will concentrate on large-scale one or maybe CRNN?


Best regards,

Nikolay Starikov

--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To post to this group, send email to audiose...@googlegroups.com.

Dan Ellis

unread,
Oct 12, 2017, 9:45:05 AM10/12/17
to Nikolay Starikov, audioset-users
Nikolay - 

I'm afraid I'm not familiar with tensorflow/tensorflow/examples/speech_commands or the deepSpeech/code codebases.  preprocess_Librispeech seems specific to a particular directory structure including text file lists of flac soundfiles which it batches up into TFRecord files.  So it could be a basis for what you want.  input_data actually reads the wav files from within the TF session and calculates the features on the fly, which seems like a neater approach.  But this is based on a very cursory glance.

Good luck,

  DAn.

To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/65e3b424-f1ca-4193-bb64-87b3d9c6b916%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages