Hi Group,
I would like to request a help from the community. This is my problem.
Say, I've to classify between three categories for an inputted audio buffer.
1. Explosion
2. Gunfire
3. Others
All these categories are available in the audio data set. I've fetched the data and preprocessed them to 2 seconds long clips with 16 kHz sampling frequency. These clips are passed to VGGish to generated embeddings of size (2,128).
I would like to add another lite classifier to be trained in these embeddings to classify the above-mentioned labels. What would be the architecture of this lite network? I'm interested in using MobileNet / SqueezeNet for on-premise prediction. But I'm worried about the data input as I can't treat a (2x128) as an image.
Please help!
Warm regards,
Sleeba Paul