You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to audioset-users
Could you please provide me with a documentation on how to use the VGGish models to extract (relevant) audio features from signals that are shorter than 1 second? I could not find anything detailed on how to do that from the documentation I found so far.
When I attempt to extract features from an audio signal shorter than 1 second, I get a negative shape error from the mel_features.py file, in function frame(), at line " return np.lib.stride_tricks.as_strided(data, shape=shape, strides=strides)". The resulting shape is (-10, 96, 64), which I print right before the return. I used the hop length provided in the Google Colab code here that is based on the AudioSet github repository: https://colab.research.google.com/drive/1E3CaPAqCai9P9QhJ3WYPNCVmrJU4lAhF#scrollTo=PPUqQtVHKggi
When I try different hop lengths, I either get a division by zero error from the mel_features.py file, in function frame(), at line "num_frames = 1 + int(np.floor((num_samples - window_length) / hop_length))" or I still get a negative strides error from the line mentioned at point 2.
Is it possible to extract audio features from 20 millisecond signal windows? What parameters should I use to avoid the above-mentioned errors?
Is it possible to obtain a 1-dimensional shaped array (i.e. vector) of audio features at the output of the VGGish model or one can only extract 2-D shaped arrays from it (i.e. matrices)?