How to extract VGGish features from 20ms audio?

55 views

Skip to first unread message

unread,

Mar 3, 2022, 10:07:51 AM3/3/22

to audioset-users

Could you please provide me with a documentation on how to use the VGGish models to extract (relevant) audio features from signals that are shorter than 1 second? I could not find anything detailed on how to do that from the documentation I found so far.
When I attempt to extract features from an audio signal shorter than 1 second, I get a negative shape error from the mel_features.py file, in function frame(), at line " return np.lib.stride_tricks.as_strided(data, shape=shape, strides=strides)". The resulting shape is (-10, 96, 64), which I print right before the return. I used the hop length provided in the Google Colab code here that is based on the AudioSet github repository: https://colab.research.google.com/drive/1E3CaPAqCai9P9QhJ3WYPNCVmrJU4lAhF#scrollTo=PPUqQtVHKggi
When I try different hop lengths, I either get a division by zero error from the mel_features.py file, in function frame(), at line "num_frames = 1 + int(np.floor((num_samples - window_length) / hop_length))" or I still get a negative strides error from the line mentioned at point 2.
Is it possible to extract audio features from 20 millisecond signal windows? What parameters should I use to avoid the above-mentioned errors?
Is it possible to obtain a 1-dimensional shaped array (i.e. vector) of audio features at the output of the VGGish model or one can only extract 2-D shaped arrays from it (i.e. matrices)?

Reply all

Reply to author

Forward

0 new messages