How to extract VGGish features from 20ms audio?

55 views
Skip to first unread message

Robert Bencze

unread,
Mar 3, 2022, 10:07:51 AM3/3/22
to audioset-users
  1. Could you please provide me with a documentation on how to use the VGGish models to extract (relevant) audio features from signals that are shorter than 1 second? I could not find anything detailed on how to do that from the documentation I found so far. 
  2. When I attempt to extract features from an audio signal shorter than 1 second, I get a negative shape error from the mel_features.py file, in function frame(), at line " return np.lib.stride_tricks.as_strided(data, shape=shape, strides=strides)".  The resulting shape is (-10, 96, 64), which I print right before the return.  I used the hop length provided in the Google Colab code here that is based on the AudioSet github repository: https://colab.research.google.com/drive/1E3CaPAqCai9P9QhJ3WYPNCVmrJU4lAhF#scrollTo=PPUqQtVHKggi
  3. When I try different hop lengths, I either get a division by zero error from the mel_features.py file, in function frame(), at line "num_frames = 1 + int(np.floor((num_samples - window_length) / hop_length))" or I still get a negative strides error from the line mentioned at point 2.  
  4. Is it possible to extract audio features from 20 millisecond signal windows? What parameters should I use to avoid the above-mentioned errors? 
  5. Is it possible to obtain a 1-dimensional shaped array (i.e. vector) of audio features at the output of the VGGish model or one can only extract 2-D shaped arrays from it (i.e. matrices)? 
Reply all
Reply to author
Forward
0 new messages