yamnet hop parameters and output size

537 views
Skip to first unread message

Roberta Rocca

unread,
Mar 4, 2020, 8:02:20 PM3/4/20
to audioset-users
Hi there!

Quick question to make sure I am correctly interpreting the meaning of yamnet parameters and how they relate to the size of the output matrix.

Is it correct that the yamnet will yield, for each label, one probability value for each "chunk" of the waveform, where the "onsets" of chunks are equally spaced by PATCH_HOP_SECONDS?

In other words, if my audio file lasts 3s, PATCH_HOP_SECONDS is 0.5s and PATCH_WINDOW_SECONDS is 1s, I should get label probabilities for:
chunk 1: onset = 0.0s, duration = 0.5s;
chunk 2: onset = 0.5s, duration = 0.5s;
chunk 3: onset = 1.0s, duration = 0.5s;
chunk 4: onset = 1.5s, duration = 0.5s;
chunk 5: onset = 2.0s, duration = 0.5s

If that is NOT correct, could you help me understand how this works?

Thank you in advance!
Roberta

Manoj Plakal

unread,
Mar 4, 2020, 8:27:09 PM3/4/20
to Roberta Rocca, audioset-users

- The model produces classifier scores, not probabilities. We have not calibrated the scores. Ideally, you would do some fine-tuning and/or calibration on data that is in your domain of interest to get outputs that are interpretable as probabilities for the kind of data that you're interested in.

- The classifier expects a window of 0.96s, you will most likely see an error, I think, but you could try.

- If window is 0.6s and hop is 0.5s, each example will be 0.6s long and will be generated every 0.5 s, so you get classifier scores for
  example 1: onset = 0.0s, duration = 0.96s
  example 2: onset = 0.5s, duration = 0.96s
  ...
  etc


  


--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/3457f714-c0c5-433a-ac13-d60f8861b133%40googlegroups.com.

Manoj Plakal

unread,
Mar 4, 2020, 8:28:13 PM3/4/20
to Roberta Rocca, audioset-users

I meant to say that you will most likely see an error if you try a window of length other than 0.96s, although I guess the pooling inside the model might hide the issue, so you can try and see. Ideally, you'd just use a window of 0.96s to avoid any issues.

Manoj Plakal

unread,
Mar 4, 2020, 8:29:01 PM3/4/20
to Roberta Rocca, audioset-users

Oops, I meant 0.96s instead of 0.6s, of course.


On Wed, Mar 4, 2020 at 8:26 PM Manoj Plakal <pla...@google.com> wrote:

Roberta Rocca

unread,
Mar 4, 2020, 8:32:01 PM3/4/20
to audioset-users
One more question then. I was a bit confused by the plot in the example notebook. There, some padding is added to the last plot to align with the spectrogram.
Which makes it seem like the first of the predictions does not refer to onset = 0.0s, but later...
R
To unsubscribe from this group and stop receiving emails from it, send an email to audiose...@googlegroups.com.

Manoj Plakal

unread,
Mar 4, 2020, 8:43:39 PM3/4/20
to Roberta Rocca, Dan Ellis, audioset-users

The first prediction does correspond to onset = 0 but it also corresponds to an entire window of input, so I don't think you can plot it at x=0, you need to offset it to represent the fact that the model is running behind the waveform by a certain amount.

+DAn for the offset calculation and why it's more complicated than just offsetting it by the window length in seconds :)



To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/e701392b-1c2d-4e56-a840-2c26c04f0092%40googlegroups.com.

Roberta Rocca

unread,
Mar 4, 2020, 8:46:39 PM3/4/20
to audioset-users
So if one had to come up with a way of mapping back each output to an onset and a duration in the original audio file, what is the equation to compute that from the model parameters?
R

Manoj Plakal

unread,
Mar 4, 2020, 8:58:37 PM3/4/20
to Roberta Rocca, Dan Ellis, audioset-users

Output #N (N >= 0) corresponds to an onset of patch-hop * N seconds and a window of patch-window seconds, as I listed in my example.

If you want to be completely accurate, the window of audio used will actually be patch-window + (stft-window - stft-hop) seconds, but you could ignore that if you're not interested in a difference of 0.015s.



To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/ffe6694c-e5a4-4608-a58c-f44cc5ebe8ff%40googlegroups.com.

Dan Ellis

unread,
Mar 5, 2020, 8:29:55 AM3/5/20
to Manoj Plakal, Roberta Rocca, audioset-users
Thanks, Manoj, for your excellent explanations.

Regarding "what is the correct time at which to display the classification results of output frame N?":
As Manoj explains, the N'th output frame corresponds to a classification of the 0.96 seconds spanning from from N*(PATCH_HOP_SECONDS) to N*(PATCH_HOP_SECONDS) + PATCH_WINDOW_SECONDS (=0.96).  Because of the way the classifier is trained, this tends to result in some "spreading" - for instance, 50 or 100ms of speech at either edge of the patch will likely result in a high score for Speech, "dilating" the marked region of speech activity by ~0.48 s at each end of a long speech burst.

The best single time at which to display the scores is the middle of the patch, which is why the displays in the notebook are shifted by 0.48 s (=PATCH_WINDOW_SECONDS/2), and why the last frame is still that distance away from the end of the waveform.

PATCH_WINDOW_SECONDS is a property of the classifier, it can't be changed without training a new model with a differently-sized input layer.  If you use a value other than 0.96 in the example code, it will likely just mess up the prediction of the number of frames and their correct alignment, it won't change the classifier scores.

  DAn.

Roberta Rocca

unread,
Jun 10, 2020, 5:41:19 AM6/10/20
to audioset-users
Hi folks, sorry for yet another question on this, but we are trying to fix a bug and I want to make sure it's not due to me misunderstanding this.
So, if I run yamnet with PATCH_HOP_SECONDS = 0.1 and PATCH_WINDOW_SECONDS = 0.96, my first output value will correspond roughly to the part of the input audio going from 0s to ~0.975s, the second output value to 0.1s to 1.075s, the third to 0.2s to 1.175s, etc... 
So output values refer to overlapping chunks of the audio. Correct?
Thanks a lot for your help!
Roberta

Dan Ellis

unread,
Jun 10, 2020, 9:07:22 AM6/10/20
to Roberta Rocca, audioset-users
Yes, that's correct.

  DAn.

To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/f06f1f8a-291f-4ef5-afb9-6a5d51cf8e51o%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages