Writing here as the original author of the AudioSet ontology ...
As you might notice, the version of the ontology published on
Github hasn't been updated since its initial release in 2017. Our internal version has gone through a large number of small changes. These include adding "Slurping" (
/m/07pqmly) (though not "Sipping") and "Washing machine" (
/m/0174k2).
"Clicking" (/m/07qc9xj) is already present, but we don't have a subclass for mouse clicking.
Having these within the ontology is not the same as having adequate examples for them, of course (or including them in a published classifier).
If you want to identify which existing AudioSet or
YAMNet classes best correspond, one (slightly circular) thing to do is to simply see what the classifier reports for examples of the new classes. I'm getting "Chewing" for drinking, a lot of "Liquid" sounds for washing machines (and some "Train" - depends on what the machine is doing, I guess), and "Computer keyboard" for mouse clicks. (These are for random samples pulled from the internet, not specifically the ESC-50 sounds, which I note are non-commercial licensed).
I admit that I'm unclear about the best role for the ontology on GitHub. I guess I meant the whole thing as a proposal, and by putting it on GitHub, I meant to indicate that we're receptive to other input about how it should be. However, in practice we've now diverged internally, and a separate evolution externally isn't terribly appealing. And, to be honest, I'm less convinced that striving for a single, universal audio event ontology is an achievable goal. My experience is that even with the classes we defined, there are almost always application-specific wrinkles that undermine the appearance of authority.
Happy to discuss further, though.
To your point, I'm not aware of an existing mapping between ESC-50 classes and AudioSet MIDs, but it seems like a nice idea. You might want to share whatever you end up with.
Rather than directly using AudioSet classifier outputs to detect ESC-50 classes, the more common style of work, I believe, is to train some kind of embedding layer on audioset data (or take the embedding from an existing classifier), then evaluate on ESC-50 by using some of the data to train a final classification layer (and evaluate on the rest).
Best,
DAn.