Use of VGGish for unsupervised feature extraction for low sample rate data

Jonathan Cleverly

unread,

Dec 9, 2024, 12:09:55 PM12/9/24

to audioset-users

Hello,

I am relatively new to machine/deep learning so please forgive any naivety in my post.

I have some large acoustic datasets from long-term hydrophone deployments, with low sample rates (one set 2 kHz, the other 1 kHz).

I am looking to create an unsupervised algorithm for processing these datasets and then separate files based on what they contain (biophonics etc.) with reducers and clustering algorithms.

Would VGGish be a suitable model for feature extraction? Is it easy to remove the steps of forming Mel spectrograms (ordinary linear spectrograms would be more suitable) and reformatting audio to the YouTube-8M format?

Any advice or pointers would be greatly appreciated!

Cheers,

Jonny

PhD student, University of Bath

Matt Harvey

unread,

Dec 9, 2024, 4:50:30 PM12/9/24

to audioset-users

Hi, Jonny.

I've done some work that applied convolutional nets to underwater data. To answer your questions first,

> Would VGGish be a suitable model for feature extraction?

I'm confident the architecture (defined in vggish_slim.py) is suitable, though it's an older one, and I've had better luck with EfficientNetB0. If we consider the checkpoint as well as the architecture, I become a little less confident because of the sample rate mismatch (16kHz vs. your 1-2kHz) and the domain mismatch (YouTube vs. hydrophone).

> Is it easy to remove the steps of forming Mel spectrograms (ordinary linear spectrograms would be more suitable) and reformatting audio to the YouTube-8M format?

The code is structured in a way that allows this. vgg_inference_demo.py calls wavfile_to_examples. Reading the implementation of wavfile_to_examples, I see that it's essentially computing a log mel spectrogram, which you could replace of modify with settings you prefer. (For me this has sometimes included both a linear frequency scale and a longer STFT window duration.) Of course, once the spectrogram is changed, it isn't valid to use the checkpoint trained on the old settings.

Now advice and pointers not exactly about one of your questions:

Besides the sample rate mismatch, YouTube vs. hydrophone soundscapes is a domain mismatch. We have released a humpback whale detection model (10kHz) and a newer multi-species whale detection model (24kHz). Being trained on underwater data, these have less of a domain mismatch but still aren't at your sample rate. They have a feature-extraction signature called features detailed in the linked docs pages.

For any of the above 3 models, it's easy enough to try them out as a feature extractor by using the released checkpoints / SavedModels and upsampling your data. My guess is the multi-species model is the most likely to do something useful, but I can't be sure.

If you end up having to retrain, and if you don't have class labels for supervised learning, you could consider a triplet loss like the one described in "Unsupervised Learning of Semantic Audio Representations." We've seen that work decently for dolphin call type classification where the audio was from the same operator, equipment, and time of year. It worked less well for the dataset referenced in the humpback model docs. For that one, temporal proximity triplets training ended up with clusters mainly organized around deployment location and year. I don't know public example code for these triplet losses.

For EfficientNetB0, there is a Keras implementation that I used in this (messy, unsupported) example. I can see from web search there are implementations for non-TensorFlow solutions too.

I hope some of this helps.

Jonathan Cleverly

unread,

Dec 10, 2024, 5:54:36 AM12/10/24

to audioset-users

Hi Matt,

Thank you for your thorough response and suggestions!

I will look into modifying the VGGish code to compute linear-frequency-scale spectrograms. I agree that there is a significant domain mismatch between my data and what the model was built to handle, but looking into the architecture of VGGish will be useful to me with regards to getting a better understanding of what goes into building DL models. I will look into the architecture of the two whale detection models and EfficientNetB0 as well.

Added to the domain mismatch I already have, the datasets are from the Arctic, where sea ice has several implications on soundscapes, affecting propagation of sound (more attenuation) and introducing a range of highly variable sounds (cracking transients and tones). The biophonic component in recordings is mainly from endemic species (bowhead whales, bearded seals and ringed seals), the vocalisations of which are perhaps not as well characterised as that of rorqual species.

Any model I use will probably need retraining to be suitable for studying this environment, triplet loss sounds like a potential direction to go in, I am in the process of annotating my data, but an unsupervised approach should hopefully streamline this.

Kind regards,

Jonny

Matt Harvey

unread,

Dec 10, 2024, 4:48:54 PM12/10/24

to Jonathan Cleverly, audioset-users

Sounds like a good plan.

A couple caveats about VGGish:

Uses TensorFlow 1.0 (TF1). Not that framework choice ought to be a popularity contest, but I think most TensorFlow users will have migrated to TF2 by now, so if sticking with TF1 it might be harder to find help and less useful from a learning job skill perspective, if that's a factor.
Residual connections and batch normalization are convincingly helpful, and VGGish doesn't use either. I can still say "suitable architecture," but not, "best." I think it's good enough to make data quality and amount, not architecture, the limiting factor.

The multispecies whale model training set was 9% from Alaska, which I guess is going to be a different part of the Arctic. We didn't include bowhead in the target classes due to lack of labels, but there was some in the data. There's also a little bit of seal, but again, not used as a target class.

Diving into frameworks:

Pytorch: most popular outside Google. Has "included" EfficientNet and spectrogram APIs.
JAX:

My favorite, but less popular outside Google.
Perch is an active bioacoustics project at Google, successor to mine. It includes its own JAX EfficientNet and spectrogram implementations

"Global birdsong embeddings enable superior transfer learning for bioacoustic classification" is worth a read.
It has ready-to-go training scripts but will take a good bit of reading through code to figure out how to feed it your own data. (I think there's no training tutorial doc.)
It's also got "hubert" and "mae," to add to your list of unsupervised approaches. In Perch, those did not beat the supervised approach, which benefited from O(100k) labels. On AudioSet (as opposed to bird species), MAE was more competitive.

When things go wrong, JAX can be a bit harder to debug than TF2, in my experience.

TF2

Has EfficientNet and STFT (which is a spectrogram, after you manually apply log(abs()), though PCEN is usually better than log.)
My earlier example link is the closest thing I know to a tutorial of how to glue them together for audio event detection.

--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/audioset-users/69813f38-6d07-49de-a3eb-0a8a9e763d9cn%40googlegroups.com.

Jonathan Cleverly

unread,

Dec 11, 2024, 6:57:11 AM12/11/24

to audioset-users

Hi Matt,

I noticed VGGish imports tensorflow.compat.v1, so I will bear that in mind. At the moment I'm trying to add functionality for linear frequency scales and removing the YouTube-8M formatting, I'll see how I get on with it.

I'll be sure to look at the other models and frameworks you suggested.

One last thing, from what I found so far audio-based machine learning algorithms typically consider 1-second snippets of spectrograms, I don't suppose you know of any models that consider larger snippets, e.g. 1-2 minutes? I can see this be changed in vggish_params.py (with the variable 'EXAMPLE_WINDOW_SECONDS'), but not sure whether this functionality has been experimented with. For sure, larger snippets will increase the computational load on the model.

Whilst I am interested in event detection, which I hope to get round to, I am mainly interested in studying soundscapes hollistically, where different acoustic processes have different time scales. From my inexperience in machine learning, larger snippets make sense to me, but I suppose I can also collect acoustic features from VGGish (or other models) along a single spectrogram and carry out various statistical approaches for studying how they vary within individual spectrograms?

Again, thank you for your advice on this!

Kind regards,

Jonny

Jonathan Cleverly

unread,

Dec 17, 2024, 8:54:43 AM12/17/24

to Matt Harvey, audioset-users

Great, I'll try out longer context windows as well, it should be a good starting point for getting a rough separation of files by acoustic processes. Then I can look at individual events more closely.

Thanks for the help Matt, I have now plenty of approaches to try going into the new year!

Happy Holidays!

Kind regards,

Jonathan

On Mon, Dec 16, 2024 at 11:34 PM Matt Harvey <matth...@google.com> wrote:

Indeed, the majority of audio machine learning models use time scales that are more speech-like than oceanic.

Since you mentioned your data is 1kHz or 2kHz, and that you expect to be retraining anyway, you could indeed choose to use a very long context window. 16 seconds at 1kHz has the same number of samples as 1 second at 16kHz. Sometimes I call this "lying about the sample rate," if I do it in the cheap way of doing exactly that during preprocessing and not touching STFT parameters. In that case and if loading pretrained model weights, I pessimistically but maybe realistically don't expect much better than garbage out. If your "acoustic processes" are stationary over your chosen context window duration and you are training from scratch, it should work fine.

My main experience with long context windows concerned blue and fin whale call detection. We chose 75s context windows because that was the duration of the duty cycle on phase in our data. We aggregated finer-resolution binary labels for each of blue and fin by the "any"-in-window rule. We trained a 2-class, multi-label, ResNet-50 on linear STFTs, 64 bands from 0-100Hz (Nyquist, so input was resampled to 200Hz). I forget the STFT window length. It was long, and I think there was also "negative overlap." Performance was decent, definitely good enough for our purpose, which was nominating 75s clips for further, more temporally local, labeling.

Your other idea to "collect acoustic features" is also valid. I'd call it mean pooling embeddings over time. But it's likely wasteful of compute if you know the natural rate of events is much slower.

To view this discussion visit https://groups.google.com/d/msgid/audioset-users/56f750f1-bbe8-435a-9298-fc8af89a9a6bn%40googlegroups.com.

Matt Harvey

unread,

Dec 17, 2024, 4:44:46 PM12/17/24

to Jonathan Cleverly, audioset-users

Indeed, the majority of audio machine learning models use time scales that are more speech-like than oceanic.

Since you mentioned your data is 1kHz or 2kHz, and that you expect to be retraining anyway, you could indeed choose to use a very long context window. 16 seconds at 1kHz has the same number of samples as 1 second at 16kHz. Sometimes I call this "lying about the sample rate," if I do it in the cheap way of doing exactly that during preprocessing and not touching STFT parameters. In that case and if loading pretrained model weights, I pessimistically but maybe realistically don't expect much better than garbage out. If your "acoustic processes" are stationary over your chosen context window duration and you are training from scratch, it should work fine.

My main experience with long context windows concerned blue and fin whale call detection. We chose 75s context windows because that was the duration of the duty cycle on phase in our data. We aggregated finer-resolution binary labels for each of blue and fin by the "any"-in-window rule. We trained a 2-class, multi-label, ResNet-50 on linear STFTs, 64 bands from 0-100Hz (Nyquist, so input was resampled to 200Hz). I forget the STFT window length. It was long, and I think there was also "negative overlap." Performance was decent, definitely good enough for our purpose, which was nominating 75s clips for further, more temporally local, labeling.

Your other idea to "collect acoustic features" is also valid. I'd call it mean pooling embeddings over time. But it's likely wasteful of compute if you know the natural rate of events is much slower.

To view this discussion visit https://groups.google.com/d/msgid/audioset-users/56f750f1-bbe8-435a-9298-fc8af89a9a6bn%40googlegroups.com.

Reply all

Reply to author

Forward