--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/23fbf2a7-2b2b-4fdc-8822-c02e5bc4d9df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
shifts it in time, adds in background noise, calculates a spectrogram, and
___
Nikolay -The two models you mention come from different target applications. Keyword spotting is concerned with speech recognition, where the network is trained to recognize a single spoken keyword or phrase over a wide range of speakers, speaking styles, and background conditions. The CNN in that work is derived from the ones commonly used in speech recognition, with relatively few convolutional layers (e.g. 1 or 2) and relatively large convolution kernel size (e.g. 20x8 time-frequency cells).The VGGish model is aimed at generic sound recognition, thus not specialized for speech or phoneme sequences. Like the KWS model, it uses a log-amplitude mel-frequency spectrogram as input, although with greater frequency resolution (64 not 32 bands). The VGGish model is inspired by work in image recognition, and uses a larger number (e.g. 4) of narrower (e.g. 3x3) convolutional layers. Unlike the KWS model, it has not been particularly optimized for computational efficiency, although this would be a natural thing to investigate.On the face of it, the general sound classification task sounds like a better match for your application, although it would probably make sense to try a range of architectures.DAn.
On Mon, Oct 2, 2017 at 9:48 AM, Nikolay Starikov <nicho...@gmail.com> wrote:
Dear colleagues,First of all thank you very much for your research in audio/sound recognition!Could you clarify for me the following question. I see in Tensorflow github 2 streams for sound recognition:1. A simple one approach using speech commands dataset and based on CNN for Small-footprint Keyword Spotting paper. (spectrogram is used).2. More complex approach using VGGish CNN model. Log mel spectrogram is used (without mmfc I guess).I'd like to train and test my own dataset in order to recongize car makes by their engine/motor sounds.Could you please advise me what approach is best suits for the task? Will Google continue to develop simple sound models or will concentrate on large-scale one or maybe CRNN?Best regards,Nikolay Starikov
--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To post to this group, send email to audiose...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/65e3b424-f1ca-4193-bb64-87b3d9c6b916%40googlegroups.com.