Hi, Jonny.
I've done some work that applied convolutional nets to underwater data. To answer your questions first,
> Would VGGish be a suitable model for feature extraction?
I'm confident the architecture (defined in vggish_slim.py) is suitable, though it's an older one, and I've had better luck with EfficientNetB0. If we consider the checkpoint as well as the architecture, I become a little less confident because of the sample rate mismatch (16kHz vs. your 1-2kHz) and the domain mismatch (YouTube vs. hydrophone).
> Is it easy to remove the steps of forming Mel spectrograms (ordinary linear spectrograms would be more suitable) and reformatting audio to the YouTube-8M format?
The code is structured in a way that allows this.
vgg_inference_demo.py calls wavfile_to_examples. Reading the implementation of wavfile_to_examples, I see that it's essentially computing a log mel spectrogram, which you could replace of modify with settings you prefer. (For me this has sometimes included both a linear frequency scale and a longer STFT window duration.) Of course, once the spectrogram is changed, it isn't valid to use the checkpoint trained on the old settings.
Now advice and pointers not exactly about one of your questions:
Besides the sample rate mismatch, YouTube vs. hydrophone soundscapes is a domain mismatch. We have released a
humpback whale detection model (10kHz) and a newer
multi-species whale detection model (24kHz). Being trained on underwater data, these have less of a domain mismatch but still aren't at your sample rate. They have a feature-extraction signature called
features detailed in the linked docs pages.
For any of the above 3 models, it's easy enough to try them out as a feature extractor by using the released checkpoints / SavedModels and upsampling your data. My guess is the multi-species model is the most likely to do something useful, but I can't be sure.
If you end up having to retrain, and if you don't have class labels for supervised learning, you could consider a triplet loss like the one described in "
Unsupervised Learning of Semantic Audio Representations." We've seen that work decently for dolphin call type classification where the audio was from the same operator, equipment, and time of year. It worked less well for the dataset referenced in the humpback model docs. For that one, temporal proximity triplets training ended up with clusters mainly organized around deployment location and year. I don't know public example code for these triplet losses.
For EfficientNetB0, there is a
Keras implementation that I used in this (messy, unsupported)
example. I can see from web search there are implementations for non-TensorFlow solutions too.
I hope some of this helps.