Active Speaker baseline model

109 views
Skip to first unread message

baptiste...@gmail.com

unread,
Aug 11, 2020, 5:05:36 AM8/11/20
to AVA Dataset Users
Hello,

I'm trying to reproduce the paper baseline but i'm struggling to achieve comparable results for the video-only model (V) using GRU. Can you please confirm the following points?
  • The embedding model takes as input 60 frames at a time. Its first layer is a 3D Conv that stack M frames together (kernel = (3x3xM) with a stride of (2,2,1))
  • The GRUs yield prediction in a "many-to-many" fashion . Therefore, the 3 seconds (60 frames) input leads to 60 predictions at a time (reason why i guess that stride = 1 for the third dimension in the above bullet)
  • The model is trained with a learning rate of 2^-6 as stated in the paper (and not 2e-6)
Thanks,

Baptiste

Joseph Roth

unread,
Aug 11, 2020, 10:58:06 AM8/11/20
to baptiste...@gmail.com, AVA Dataset Users
Baptiste,

I hope this picture of a 3-frame GRU model can help answer some of your questions.
image.png

  • The embedding model takes as input 60 frames at a time. Its first layer is a 3D Conv that stack M frames together (kernel = (3x3xM) with a stride of (2,2,1))
For the visual only GRU model we experimented using anywhere from 1-5 input frames, but found that you don't need more than two.  The first layer is not a 3D conv, but rather a 2D conv.
  • The GRUs yield prediction in a "many-to-many" fashion . Therefore, the 3 seconds (60 frames) input leads to 60 predictions at a time (reason why i guess that stride = 1 for the third dimension in the above bullet)
The GRU predicts a single output for each timestamp.  The internal state from the prior timestamp is passed when predicting the next timestamp.
  • The model is trained with a learning rate of 2^-6 as stated in the paper (and not 2e-6)
The learning rate is 2^-6 or 0.015625


--
You received this message because you are subscribed to the Google Groups "AVA Dataset Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ava-dataset-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ava-dataset-users/8ed5d17c-a0d6-465b-a80e-a3a8d0780b6an%40googlegroups.com.

baptiste...@gmail.com

unread,
Aug 11, 2020, 12:18:28 PM8/11/20
to AVA Dataset Users
Thank you for your prompt answer, it helps a lot!

If i understood, when given the 3 first frames (Frame 1, 2 and 3 in your picture), the last output of the GRU (the most recent hidden state) is connected to the FC and you get a prediction of the Frame 1's label. 
Then, you slide by one frame and get the prediction of the Frame 2 by giving the frames 2, 3 and 4 to the model... and so on.

Is that right?

Thanks, 

Baptiste




Joseph Roth

unread,
Aug 11, 2020, 12:25:50 PM8/11/20
to baptiste...@gmail.com, AVA Dataset Users
You can either make the prediction for frame 1, 2, or 3 in that case.  We didn't notice a substantial difference in which frame you choose for the label when training.  The advantage to picking the last frame is that you are always making a prediction for the current time instance instead of having a delay and predicting the past.

This means that if you are predicting a N frame sequence with an M-frame classifier, you only get N - M + 1 predictions if you don't pad.  You can either pad the inputs with zero frames or pad the output by cloning the decisions at the start/end of the sequence.  In practice we choose the latter approach.

baptiste...@gmail.com

unread,
Aug 12, 2020, 3:48:50 AM8/12/20
to AVA Dataset Users
Thanks a lot!

So the 3 second window (60 frames) mentioned in the article defines the length of the sequence after which each backward pass is performed?

Thanks,

Baptiste
Reply all
Reply to author
Forward
0 new messages