Reconstructing the Results released using VGGish as a feature extractor.

568 views
Skip to first unread message

Ravi Jain

unread,
Dec 7, 2017, 11:44:51 PM12/7/17
to audioset-users
Hello,
This is in regard with the results shown in, Hershey, S. et. al., CNN Architectures for Large-Scale Audio Classification, ICASSP 2017 paper, Section 4.4. AED with the Audio Set Dataset.

'The first model uses 64×20 log-mel patches and the second uses the output of the penultimate “embedding” layer of our best ResNet model as inputs. The log-mel baseline achieves a balanced mAP of 0.137 and AUC of 0.904 (equivalent to d-prime of 1.846). The model trained on embeddings achieves mAP / AUC / d-prime of 0.314 / 0.959 / 2.452.'

Alsoi @DAn mentioned in one of the threads :

'Our preferred accuracy metric is "Top N_L correct", which, for each eval segment, looks at the N_L highest-scoring classifier outputs, where N_L is the number of ground-truth labels associated with the eval segment (which varies from segment to segment).  Then the accuracy for that segment is the proportion of those top N_L classifications that correspond to the ground truth, so it always lies between 0 and 1.  The overall accuracy is the average of the per-segment accuracies over all the evaluation segments.  This is the same measure (I believe) as what is called "Precision at Equal Recall Rate" (PERR) in the YouTube-8M paper.'


Query:
Now I am training a fully connected classifier on VGGish features released with the final classification layer having sigmoid activation with multi-hot vector labels(size : 527) and am having trouble understanding how the test results presented in the paper were computed.
I understand that the Audioset is weakly labelled, i.e, and audio segment labelled as 'Human speech' may not be labelled as 'Human Sound' or 'Human Voice'(which it is according to the ontology). Its not a issue while training but while testing (considering our model trained ideally) the model shall predict the audio segment labelled 'Human Speech' as Human Sound and Human Voice as well probably with higher accuracy. So following "Top N_L correct" accuracy might give incorrect label while testing
eg: say on the audio segment labelled as Human speech, our model predicts {Human Speech : 0.8, Human Voice : 0.85, Human Sound: 0.9) thus classifying it as Human Sound (using Top N_L correct) driving the accuracy on the eval segment down.

How is this issue taken care of?

TL;DR
 How is testing done with weakly labelled dataset when the labels aren't independent?


Regards,
Ravi Jain

Manoj Plakal

unread,
Dec 8, 2017, 12:18:50 PM12/8/17
to Ravi Jain, audioset-users, Dan Ellis

If I'm understanding correctly, you're asking if the existence of a ground truth label (e.g., Human Speech) also implies the existence of its ancestors in the ontology (e.g., Human Voice, Human Sound) as additional ground truth labels.

For evaluation purposes, we do not use any such implied ground truth labels. We use only the labels explicitly listed in the dataset. 

There are some tricky issues with always assuming the ancestors of a particular label also apply, because in some cases the parent node ends up as the ground truth label for cases where the child node does not apply (e.g., Human Voice sounds that are not Human Speech). I'ved cced DAn in case he wants to elaborate on this.





--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/414a0ba4-f64b-4653-a5f9-d98b57b5021d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ravi Jain

unread,
Dec 8, 2017, 9:22:35 PM12/8/17
to audioset-users


On Friday, 8 December 2017 22:48:50 UTC+5:30, Manoj Plakal wrote:

If I'm understanding correctly, you're asking if the existence of a ground truth label (e.g., Human Speech) also implies the existence of its ancestors in the ontology (e.g., Human Voice, Human Sound) as additional ground truth labels.
Yes. Also, considering our model is trained well it shall predict the parent labels long with the 'Human Speech' label. 
 
For evaluation purposes, we do not use any such implied ground truth labels. We use only the labels explicitly listed in the dataset. 
I interpreted the same through @DAn's reply. But wouldn't it drive my accuracy down if my model is predicting a 'Human Speech' labelled audio segment as 'Human Sound' as well with probably higher confidence when the the 'Human Sound' is not in the listed labels.
 
There are some tricky issues with always assuming the ancestors of a particular label also apply, because in some cases the parent node ends up as the ground truth label for cases where the child node does not apply (e.g., Human Voice sounds that are not Human Speech). I'ved cced DAn in case he wants to elaborate on this.
Yes, I agree. One more tricky issue is that for some label the ancestor labels can be ambiguous. For example, Snoring can be under Human Sound or Animal Sound as well. So I understand we cannot exactly add ancestor labels as ground truth labels for evaluation set. 

Regards,
Ravi Jain 

Manoj Plakal

unread,
Dec 8, 2017, 9:39:53 PM12/8/17
to Ravi Jain, audioset-users
On Fri, Dec 8, 2017 at 9:22 PM, Ravi Jain <ravij...@gmail.com> wrote:


On Friday, 8 December 2017 22:48:50 UTC+5:30, Manoj Plakal wrote:

If I'm understanding correctly, you're asking if the existence of a ground truth label (e.g., Human Speech) also implies the existence of its ancestors in the ontology (e.g., Human Voice, Human Sound) as additional ground truth labels.
Yes. Also, considering our model is trained well it shall predict the parent labels long with the 'Human Speech' label. 
 
For evaluation purposes, we do not use any such implied ground truth labels. We use only the labels explicitly listed in the dataset. 
I interpreted the same through @DAn's reply. But wouldn't it drive my accuracy down if my model is predicting a 'Human Speech' labelled audio segment as 'Human Sound' as well with probably higher confidence when the the 'Human Sound' is not in the listed labels.

Whether accuracy goes up or down depends on the metrics you care about :)

If you like the idea of predicting ancestors and the various associated caveats don't matter for your particular task, then you're free to define your own evaluation metric where you automatically fill in ancestors of existing labels as additional ground truth labels. 

If you'd like to reproduce our published results, then you should use the same metrics. 

Also, based on our experience, models will always predict Speech and Music higher than any other class due to the very large skew in the number of such labels for YouTube clips, so I don't think you should worry about your model not predicting Speech :)  It is in fact a problem induced by the dataset, so we try and make it not predict Speech as often as it does. This is also a case where evaluation metrics matter. If we went purely with Top-N_L accuracy on AudioSet, then it's easy to make a model that always predicts Speech and gets a high number, but if we wanted to use that model for real-world tasks, we would have issues.  Other metrics, such as mAP and d-prime/AUC are less sensitive to this issue and are better suited to comparing models.



 
There are some tricky issues with always assuming the ancestors of a particular label also apply, because in some cases the parent node ends up as the ground truth label for cases where the child node does not apply (e.g., Human Voice sounds that are not Human Speech). I'ved cced DAn in case he wants to elaborate on this.
Yes, I agree. One more tricky issue is that for some label the ancestor labels can be ambiguous. For example, Snoring can be under Human Sound or Animal Sound as well. So I understand we cannot exactly add ancestor labels as ground truth labels for evaluation set. 

Regards,
Ravi Jain 

--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.

Vinith Misra

unread,
Dec 22, 2017, 8:20:26 PM12/22/17
to audioset-users
Hi Manoj,

Instead of creating another thread, let me piggyback on this one for the topical similarity.

I've been attempting for some time to replicate the validation set numbers quoted in the paper (.95 mean AUC, .3ish mAP) and am struggling to get anywhere *remotely* near there. I'm seeing a .75ish AP for Music (short of the reported .89), and a .01 (!) mean AP across all classes. Similarly, I'm seeing a .9 AUC for Music (treating it as binary classification), but a .58 AUC averaged across all classes.

The two explanations that come to mind are the architecture/hyperparameter choices I'm making, and my implementation of the mean AUC and mAP metrics, but I'm having trouble poking holes in either.

- Any chance you guys could share a specific architectural choice / optimizer / hyperparameter choice that you know can get in the ballpark of the quoted figures? The hints in this google group seem to suggest a shallow stack of 128-500 unit FC layers with ___ nonlinear activation, trained on the combined balanced/unbalanced train set using your favorite off-the-shelf optimizer. Just to rule this out as a problem, I'd be really grateful if you could let me know a more specific set of hyperparameters. (though I have trouble imagining that this is the issue for such a dramatic difference in numbers...)

- To compute AP for a given class, I'm looking at the binary labels for the corresponding column of the eval set examples (20k 10s segments), and their predictions, and then feeding into sklearn.metrics.average precision score. I believe this sorts the eval set examples (20k 10s segments) by the scores for that given class, computes the precision at every positive example, and averages. To compute mean AP, I'm averaging those numbers across all the classes (equal weighting). Similar story for AUC: sklearn.metrics.roc_auc_score followed by a simple unweighted mean across all classes.

Again, I'd be super grateful for any help!

best,
-Vinith
To unsubscribe from this group and stop receiving emails from it, send an email to audioset-user...@googlegroups.com.
To post to this group, send email to audiose...@googlegroups.com.

Manoj Plakal

unread,
Dec 22, 2017, 10:51:48 PM12/22/17
to Vinith Misra, Dan Ellis, audioset-users

Hi Vinith,

I'm on vacation right now so I'm not supposed to be checking my email :) 

I've cced DAn to help answer your questions while I'm gone, although I would expect slow replies given the coming holiday week.

One thing to note: the numbers you see in our papers use embeddings from the Resnet-50 architecture trained on a private dataset of 100M YouTube videos, so it's understandable to see lower numbers for VGGish embeddings, although your numbers for all-class means do seem low.  Perhaps you could share the architecture/hyperparameters and the code you use for training and eval?

Manoj



To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/65933fba-c8f3-4344-812e-1dd51b2fd14b%40googlegroups.com.

Vinith Misra

unread,
Dec 23, 2017, 3:04:31 AM12/23/17
to Manoj Plakal, Dan Ellis, audioset-users
Hi Manoj (and Dan),

Thanks so much for the rapid response, and so sorry for pestering you guys at this time of the year! Please take your time in responding :).

After I sent the past email, I performed the simple experiment of independently performing logistic regression for each class, and the aforementioned problems disappear. If I attempt to jointly train an LR for all the classes at once using SGD, the class imbalance makes for dramatically different learning rates for the different classes’ weight vectors (even with rarely-occurring-weight adaptive optimized like adagrad). So this seems to suggest that one should weight each class’s contribution to the loss.

Do you recall if you folks had any such explicit class balancing machinery at work, beyond the balanced train set? Or were there architectural /optimization choices that made that unnecessary?


Thanks!

Vinith


Dan Ellis

unread,
Dec 23, 2017, 9:30:33 PM12/23/17
to Vinith Misra, Manoj Plakal, audioset-users
Yes, we've had to take steps to handle class imbalance.  We will (most likely) describe them in our next publication on the topic.  Until then, I'm leery of discussing details that we don't have explicit permission to release.

  DAn.


Vinith



To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.

To post to this group, send email to audioset-users@googlegroups.com.

Vinith Misra

unread,
Dec 28, 2017, 9:20:25 PM12/28/17
to Dan Ellis, Manoj Plakal, audioset-users
Many thanks for confirming my suspicions, and totally understood re: sensitivity around unreleased details.

By balancing each class's contribution to total cross entropy loss for each sample by the .5/class_frequency for positive samples and .5/(1-class_frequency) for negative samples I'm able to come within reasonable proximity of the published numbers.  I'd be curious to (eventually) learn what approach you folks took.

best wishes,
-Vinith


Vinith




For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "audioset-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/audioset-users/N28jbDTASLg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to audioset-user...@googlegroups.com.

To post to this group, send email to audiose...@googlegroups.com.

Dan Ellis

unread,
Dec 29, 2017, 9:17:39 AM12/29/17
to Vinith Misra, Manoj Plakal, audioset-users
That sounds like a good plan.

Which published figures are you comparing to?  The numbers in our ICASSP paper are of course for a smaller set of classes and training set.  We are working on releasing actual evaluation code to enable directly-comparable numbers, but progress is slow - our apologies.

  DAn.

On Thu, Dec 28, 2017 at 9:20 PM, Vinith Misra <vin...@alum.mit.edu> wrote:
Many thanks for confirming my suspicions, and totally understood re: sensitivity around unreleased details.

By balancing each class's contribution to total cross entropy loss for each sample by the .5/class_frequency for positive samples and .5/(1-class_frequency) for negative samples I'm able to come within reasonable proximity of the published numbers.  I'd be curious to (eventually) learn what approach you folks took.

best wishes,
-Vinith


Vinith



To unsubscribe from this group and stop receiving emails from it, send an email to audioset-users+unsubscribe@googlegroups.com.

To post to this group, send email to audioset-users@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "audioset-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/audioset-users/N28jbDTASLg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to audioset-users+unsubscribe@googlegroups.com.
To post to this group, send email to audioset-users@googlegroups.com.

Vinith Misra

unread,
Dec 29, 2017, 9:54:14 AM12/29/17
to Dan Ellis, Manoj Plakal, audioset-users
No need to apologize --- thank you so much for releasing all that you have already, and for being so responsive to questions! I am indeed benchmarking against the ICASSP paper (485 classes).

Kong et al's recent arxiv upload (https://arxiv.org/abs/1711.00927) quotes similar numbers for 527 classes, even in the absence of balancing (they do experiment with a minibatch balancing scheme, but it only yields marginal improvement). I can only assume that they must also be "silently" performing balanced training.

best wishes,
-Vinith

On Fri, Dec 29, 2017 at 6:17 AM 'Dan Ellis' via audioset-users <audiose...@googlegroups.com> wrote:
That sounds like a good plan.

Which published figures are you comparing to?  The numbers in our ICASSP paper are of course for a smaller set of classes and training set.  We are working on releasing actual evaluation code to enable directly-comparable numbers, but progress is slow - our apologies.

  DAn.
On Thu, Dec 28, 2017 at 9:20 PM, Vinith Misra <vin...@alum.mit.edu> wrote:
Many thanks for confirming my suspicions, and totally understood re: sensitivity around unreleased details.

By balancing each class's contribution to total cross entropy loss for each sample by the .5/class_frequency for positive samples and .5/(1-class_frequency) for negative samples I'm able to come within reasonable proximity of the published numbers.  I'd be curious to (eventually) learn what approach you folks took.

best wishes,
-Vinith


Vinith




For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "audioset-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/audioset-users/N28jbDTASLg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to audioset-user...@googlegroups.com.

To post to this group, send email to audiose...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "audioset-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/audioset-users/N28jbDTASLg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to audioset-user...@googlegroups.com.

To post to this group, send email to audiose...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages