YAMNet 1024D embedding with audio longer of 1 s

Michele Esposito

unread,

May 7, 2020, 12:37:11 PM5/7/20

to audioset-users

Hello,

I am trying to extract the 1024 embedding of sounds that are of 5 sec and sampled at 44.1kHz, 16bit .wav files.. they are 1500 and this is the code:

for f in files:

wav_file=path+f

wav_data, sr = sf.read(wav_file, dtype=np.int16)

assert wav_data.dtype == np.int16, 'Bad sample type: %r' % wav_data.dtype

wave = wav_data / 32768.0 # Convert to [-1.0, +1.0]

#Convert to mono and the sample rate expected by YAMNet.

if len(wave.shape) > 1:

wave = np.mean(wave, axis=1)

if sr != params.SAMPLE_RATE:

wave = resampy.resample(wave, sr, params.SAMPLE_RATE)

# create a specific model that takes as input the input of yamnet, and outputs the layers

extractor = Model(inputs=yamnet.input, outputs=yamnet.get_layer('global_average_pooling2d').output)

with graph.as_default():

features = extractor.predict(np.reshape(wave, [1, -1]), steps=1)

x=np.squeeze(features)

embedding[i,:]=x

i += 1

and I got this error

" could not broadcast input array from shape (41,1024) into shape (1,1024) "

When I got the features instead of a 1x1024 Embedding I found 41x1024

What does that 41D mean?

Why I got 41D?

Thank you

Dan Ellis

unread,

May 7, 2020, 1:43:31 PM5/7/20

to Michele Esposito, audioset-users

I think the issue is that YAMNet generates a frame of classifications (or embedding) for frames extracted with 100ms hop between frames. So, with 5 sec of input, you can fit 41 successive frames (of 975ms) into the classifier, and you get the embeddings from all 41 of them as output.

If you want a summary of the entire clip, you can do something simple like averaging across the frame dimension. I'm not sure how averaging embeddings affects the results (it's more natural in the class scores domain) but it would probably be OK.

Does that make sense?

DAn.

--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/9cd82386-fab8-4851-b2b1-a5b16df9244b%40googlegroups.com.

Manoj Plakal

unread,

May 7, 2020, 2:01:27 PM5/7/20

to Dan Ellis, Michele Esposito, audioset-users

If you're combining embeddings from multiple frames, then you should read the output of the layer preceding the global average pool (i.e., read the output of the last separable convolution layer), and then do the global average pool yourself. The dimensionality of this pre-pooled embedding will probably be larger than 1024.

To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/CAH-fAfWz0A7%3DzxvKKLyt2FpSEwJjh23pVL_hkH1%2B9zjEoaUx4g%40mail.gmail.com.

Michele Esposito

unread,

May 7, 2020, 4:15:37 PM5/7/20

to audioset-users

@Dan:

I don't think averaging in a semantic space is good no? I will try this method though and I could let you know

@Manoj

So you mean the layer 3x2x1024 or the previous one? How can I do an global averaging by myself in order to average the whole 41 frames in a one classification? Is not the same of averaging along the 41 dimension the 41x1024 embedding as Dan suggests?

Thank you for the help

Manoj Plakal

unread,

May 7, 2020, 4:28:06 PM5/7/20

to Michele Esposito, audioset-users

My suggestion was to extract the layer with activation shape [3, 2, 1024] that feeds into the global average pool, flatten it and treat it like a 6144-d embedding of each frame of audio. Then aggregate that embedding however you like across frames of a clip. It could be a simple averaging across time and frequency to get 1024-d (which would be essentially the same as just averaging the average-pooled output) or you could do something more complicated (some kind of learned dimensionality reducer from 6144->1024 that works better for your task).

In general, I think that if you are working with embeddings instead of scores, and you need to aggregate embeddings for some larger input, then you probably want to extract the model output prior to any pooling or aggregation so that you have control over how aggregation happens.

Can you describe what you are doing with these clip-level embeddings?

--

You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/473ea5f4-cfc1-45e4-9b23-94729772cc05%40googlegroups.com.

Michele Esposito

unread,

May 7, 2020, 5:09:18 PM5/7/20

to audioset-users

Hi Manoj

I will try your solution and I'll let you know...

I am trying to develop a network that can classify the sounds through a hierarchy, for now using only two level of semantic labels. So I'm trying to do a Sound Event Classification using the ontological Layer...

I would like to classify all the 5 second sound with one label and try further to obtain also the sub-label of the sound that is related to the main class.

I don't know If I explain you well the situation.

Thank you for the support

Il giorno giovedì 7 maggio 2020 18:37:11 UTC+2, Michele Esposito ha scritto:

Manoj Plakal

unread,

May 7, 2020, 5:30:54 PM5/7/20

to Michele Esposito, audioset-users

And are the classes/sub-classes predicted by your network not already covered by the 521-class vocabulary of YAMNet?

If we already have the classes that you want, then you could just average the scores predicted by the model over all frames in a clip instead of dealing with the embeddings and aggregation.

--

You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/237cc5a0-4696-4256-b46a-1ebfe2244b7a%40googlegroups.com.

Michele Esposito

unread,

May 7, 2020, 6:04:49 PM5/7/20

to audioset-users

Hi

the point is that I take the dataset from a MSOS challenge and they give also a "vocabulary" of 5 main classes and 97 subclasses..so I would like to adapt that to the sounds of the challenge using YamNet.. So i would like to extract the embedding 1024 feature of the sound and then using the ontological layer try to predict the label hierarchy... so I would like to retrain again the network fixing the YamNet weights. I would like to use YamNet as part of a larger model in the same way of VGGish.

Il giorno giovedì 7 maggio 2020 18:37:11 UTC+2, Michele Esposito ha scritto:

George Boateng

unread,

May 7, 2020, 7:35:19 PM5/7/20

to Michele Esposito, audioset-users

Hi Michele,

I'm dealing with a similar issue with the 41 feature vectors for a 5-sec clip. It's because the constant PATCH_HOP_SECONDS is set to 0.1 sec. If you want fewer vectors, you can change that constant to be larger (I'm using 0.5 sec for example).

In terms of how to handle several vectors per clip, I'm using majority voting of each vector's classification to assign a class to the whole clip.

For fine-tuning YAMNet, I think someone asked a similar question and the response was that it's not possible and that the model is given as it for either feature extraction or classification. Someone can correct me if I'm wrong though.

Best,

George

--

You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/a08a10e7-73e3-47c2-9dc6-2382c9c5dd8d%40googlegroups.com.

Manoj Plakal

unread,

May 7, 2020, 7:47:02 PM5/7/20

to George Boateng, Michele Esposito, audioset-users

Fine-tuning the released YAMNet model isn't impossible, but you will need to do a fair bit of work to train the core model with framed examples instead of entire clips, as described in https://github.com/tensorflow/models/issues/8425

To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/CACpp%3DHfPTZ0%3DUEp49hYFGVZC_nFwb7_OBh7zC4SriNWQ-b1M%2BA%40mail.gmail.com.

Message has been deleted

Manoj Plakal

unread,

May 8, 2020, 12:51:24 PM5/8/20

to Michele Esposito, audioset-users

For a clip-level embedding, I would start with the simple averaging over frames that DAn suggested (which is equivalent to what I suggested about taking the previous layer's output and aggregating yourself, if you're just averaging for aggregation).

VGGish and YAMNet have the same issue here: we provide embeddings for individual frames and you have to aggregate somehow to get a clip-level embedding. I would start with simple aggregation such as averaging.

On Fri, May 8, 2020 at 6:31 AM Michele Esposito <michele.es...@gmail.com> wrote:

@George
So you suggest to work with classification score and extract from there the labels?
@Manoj
I would like to use the embedding feature 1024 as input to a network that can replicate the ontological layer in order to predict the Hierarchy as is described in this paper :
https://openreview.net/pdf?id=HkGv2NMTjQ
Is it possible with YAMNet? Does make sense taking the embedding output of YamNet and doing this thing?
Or is easier doing with VGGish?

Have a nice day.

Il giorno giovedì 7 maggio 2020 18:37:11 UTC+2, Michele Esposito ha scritto:

--

You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/dd324bec-fa24-40e2-8a21-69f8dc30a245%40googlegroups.com.

Reply all

Reply to author

Forward