How to use AVA ActiveSpeaker Dataset

211 views
Skip to first unread message

lich...@gmail.com

unread,
Mar 16, 2019, 10:04:33 AM3/16/19
to AVA Dataset Users
hello, is there anyone who knows how to use the newest-released AVA ActiveSpeaker Dataset? 
I know the meaning of each items in the annotation file, but  I cannot figure out how to process the videos according to the annotations. 

As an example from the annotation:
tghXjom3120,1621.23,0.405556,0.135417,0.648611,0.558333,NOT_SPEAKING,tghXjom3120_1620_1680:3
tghXjom3120,1621.28,0.405556,0.135417,0.65,0.5625,NOT_SPEAKING,tghXjom3120_1620_1680:3
tghXjom3120,1621.31,0.404167,0.135417,0.652778,0.56875,NOT_SPEAKING,tghXjom3120_1620_1680:3
tghXjom3120,1621.35,0.402778,0.135417,0.655556,0.577083,NOT_SPEAKING,tghXjom3120_1620_1680:3

where the second item(1621.23 ... ) is timestamp, but i dont know how to use this information while extracting frames.

If I extract clip frames by 20fps as the paper denotes, the interval should be 0.05s, but in the annotation file, it can be 0.03s,0.04s or 0.05s, so how can I preprocess the video by the annotations? Can anyone give me some advices or provide me with the dataset-parser code?

Thanks a lot!

Sourish Chaudhuri

unread,
Mar 17, 2019, 1:20:08 AM3/17/19
to lich...@gmail.com, AVA Dataset Users
Hi,
     The timestamps in the released data correspond to frame timestamps in the full frame rate video, and depend on the frame rate in the corresponding video. 

     The use of 20 fps is only for the experiments section. We use this to have a constant number of frames per second across videos, since each video may have a different frame rate. If you start with full frame rate videos, and resample the video frames to obtain 20 fps, you should also adjust the labels appropriately.

Thanks,
 Sourish

--
You received this message because you are subscribed to the Google Groups "AVA Dataset Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ava-dataset-us...@googlegroups.com.
To post to this group, send email to ava-data...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ava-dataset-users/3fbafb3a-01a3-47e5-a671-c297ad01c7e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

cherryli

unread,
Mar 17, 2019, 11:25:27 PM3/17/19
to AVA Dataset Users
Thanks for your reply! I am very appreciate for your work and all of you.
By the way, I am wondering when the code for baseline model or the trained-model will be released? I have been recently doing some research on the task of active speaker detection in a certain scene, so if the baseline model can be obtained, it would be a big help for me when using the base model as pretrained model for  our dataset is rather small. 

Thanks,
Cherry




在 2019年3月17日星期日 UTC+8下午1:20:08,Sourish Chaudhuri写道:

cherryli

unread,
Mar 20, 2019, 3:27:15 AM3/20/19
to AVA Dataset Users
Hi,
    I have another question when using the dataset. When there are more than one person in the screen shots, does each training example contain a sequence of face images from the same person? The face tracks of each person in 3 seconds can be formed as a training example, is it right? 
    It is said that you use 20fps in the experiments, what if  a certain person appeared less than 20 frames?  For example, in one second, person A appeared for 25 frames, it can be easy to extract 20 frames for him. But person B only appeared for 15 frames, should I pad 5 zero frames at the end?
    Very eager for your reply!

Thank,
Cherry


 


在 2019年3月17日星期日 UTC+8下午1:20:08,Sourish Chaudhuri写道:
Hi,

Joseph Roth

unread,
Mar 20, 2019, 8:23:41 AM3/20/19
to cherryli, AVA Dataset Users
If there are multiple people appearing in the same frame, there will be multiple entries with the same timestamp, but with a different entity_id.  You should use the entity_id to combine multiple bounding boxes over time to form a track.

There is a difference between the frame rate of a video and how many frames a face track appears.  But for low frame rate videos, we performed upsampling by cloning/copying the most recent frame, instead of padding with 0 frames.

cherryli

unread,
Mar 28, 2019, 10:14:33 AM3/28/19
to AVA Dataset Users
  Thanks for your reply ^^
 
  I have generated the training samples and testing samples in the way as what you said.
  As for the visual feature, I combined the face track in a certain 3 second clip for each face id. As for the audio feature, I selected the corresponding audio clip and extracted the MFCC feature from the first 0.5s, as was said in the paper(The Mel-spectrogram input to the
audio network is 64*48*1 and is computed over the preceding 0.5 seconds of audio). I wonder whether you do normalization for the mfcc feature scaling to (0,1)? Are there any problems with my preprocess for the dataset?
  
  The other problem is, when I fed the training sample to the network 2D CNN or 3D CNN based on VGGNet16, the network doesn't converge. Are there any advises or tricks for training?

Thanks,
Cherry


在 2019年3月20日星期三 UTC+8下午8:23:41,Joseph Roth写道:
To unsubscribe from this group and stop receiving emails from it, send an email to ava-data...@googlegroups.com.

Sourish Chaudhuri

unread,
Mar 28, 2019, 2:14:48 PM3/28/19
to cherryli, AVA Dataset Users
Hi Cherry,
                It sounds like you're roughly in the right place, although I can't totally tell from your description, so a few clarifications based on your email below:

- We associate each visual bounding box with the previous 0.5 seconds of audio.
- For the audio representation, we use the Mel spectrum with 64 channels, not the MFCC.
- We use linear magnitude for the Mel spectrum. 
(I wouldn't expect using the MFCC to result in a significant difference from the numbers we've reported, although using 64 coefficients feels like it would be more than necessary.)
- The feature computation uses 25 ms windows with 10 ms hopsize, and we use sufficient previous context for the total duration to be ~0.5 seconds.
- We don't do any additional normalization.

                Separately, on the training portion, I assume the network convergence question is independent. I can't think of any obvious reasons why that would be except to suggest trying out different hyperparameter values for training: batch size, learning rate, network size, etc. Perhaps others on the list can chime in with their experiences here, too.

                Hope this helps!

Thanks,
 Sourish

chuyi li

unread,
Mar 28, 2019, 11:15:57 PM3/28/19
to Sourish Chaudhuri, AVA Dataset Users
Hi Sourish,
    Thanks for your explanation. I will try out what you advised, thank you.


Thanks,
Cherry

Sourish Chaudhuri <so...@google.com> 于2019年3月29日周五 上午2:14写道:

Ritvik Agrawal

unread,
Mar 2, 2022, 12:45:14 PM3/2/22
to AVA Dataset Users
Hi !

I'm facing a similar issue as mentioned i.e. how to choose the frame rate and synchronize the corresponding audio and video.
Can you please once verify, if the following script is correct or not for the data extraction/preprocessing part?

Regards
Ritvik Agrawal

so...@google.com

unread,
Mar 2, 2022, 2:33:41 PM3/2/22
to AVA Dataset Users
Hi Ritvik,
               Different approaches, based on different modeling assumptions, have been used by different efforts in deciding what an optimal frame rate for modeling is. We've used 20 fps for modeling in our experiments, and at training time, the model looks at the previous K frames (where K is a parameter) and the previous 0.5 seconds of audio ending at the timestamp corresponding to the latest image frame in the stack of K frames. 

              Regarding the codebase you pointed to, Okan  Kopuklu is probably the right person to reach out to, if you need clarifications.

Thanks,
 Sourish

Reply all
Reply to author
Forward
0 new messages