Hi everyone,
Many thanks to Audioset team for releasing temporally strong labels for a large amount of data.
I have a doubt about how to make full use of this strong labels for the audio classification task- the strong labels are released with a temporal resolution of approximately 0.1 sec, however the audioset embeddings are computed on 960ms/1s grid.
In my understanding, the audio classification task using the current audioset embeddings (N, 10, 128) offer the following training setups -
- weak labelled training with a label resolution of 10sec
- Strong labelled training with a label resolution of 1 sec
Which means though the temporally strong labels are having a temporal resolution of 0.1 sec we cannot make use of this in the training, otherwise the embeddings should be computed in 100 ms resolution (N, 100, 128).
Can someone please comment on this? Please correct me if I get it wrong or missed any contents in the paper or any follow-up updates.
Arjun
PhD scholar
C4DM, QMUL