Questions on the baseline model for AVA

Alain Raymond

unread,

Jun 28, 2019, 10:22:25 AM6/28/19

to AVA Dataset Users

Hello, I'm trying to replicate the AVA Baseline (for which you provided a model checkpoint) on Pytorch, however I'm getting extremely poor results. My baseline mostly always predicting the Watch (a person) Class (which I understand because it's the most popular one).

Thus I wanted to ask you a few questions about it, if it's okay:

Assumptions

Currently working with AVA 2.2
I'm only interested in classifying actions not bbox localization.

Baseline:

I see from the model checkpoint that it consists of a faster RCNN, coupled with a Resnet101, where we take features from the average pooled features from the 4th block that produces a vector of dimensionality 2048 that is then passed through a linear layer.
The model only takes one frame. Is this accurate? Am I missing an activation function, nonlinearity or another linear layer? As I'm only interested in classification I'm cropping the image according to ground truth and then passing that to the Resnet. This should be the same that you do, yes?
Was the Resnet used for the baseline model fixed during training or trained?
I'll be using a Resnet50. I expect results to be poorer but not much different from Resnet 101. Or do you think this could be an issue? Did you by any chance try it?
Did you use any learning rate scheduling? Warmup or decay? Reduction based on validation loss?
Did you do anything special to combat the class imbalance in the dataset for the baseline? (like weighted losses or something?)

Dataset

To create my samples from the dataset I'm doing the following:

Aggregating information from the provided csv by:

Video id
Second
Person Id
Target is BoW vector of actions. I.e: [1 0 0 0 1 0 0 0 ...] for that second, that person, that video.
Bbox for that person_id, second, video_id

I'm getting a sample count for training of around 329.000. From the paper I see a count of 210,634 training samples. This seems to be around 50% more than what you have. Even if I consider the 2.8% extra samples that 2.2 provides over 2.1.
Do you do anything else to filter you samples? Do you perhaps consider in the same sample all persons as you are doing bbox localization?

Thank you very much in advance!

Best regards,

Alain Raymond

David Ross

unread,

Jun 28, 2019, 10:55:02 AM6/28/19

to Alain Raymond, AVA Dataset Users

Hi Alain,

Dataset

That is the correct number of samples. Aggregating by video, second, and bounding box:

$ cat ava_train_v2.2.csv | cut -d, -f 1-6 | uniq | wc -l
332353

In the paper (section 6.1) "we use classes that have at least 25 instances in validation and test splits to benchmark performance".

For the AVA Challenge, we've selected 60 of the 80 classes, listed here http://research.google.com/ava/download/ava_action_list_v2.2_for_activitynet_2019.pbtxt. I believe that people who report numbers on AVA focus on those 60 classes.

Regards,

David

--
You received this message because you are subscribed to the Google Groups "AVA Dataset Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ava-dataset-us...@googlegroups.com.
To post to this group, send email to ava-data...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ava-dataset-users/1aa3af2c-b0a1-4efb-b101-c5de6a955fd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chen Sun

unread,

Jun 28, 2019, 5:34:29 PM6/28/19

to Alain Raymond, AVA Dataset Users

Hi Alain,

Please find answers about the baseline model:

On Fri, Jun 28, 2019 at 7:22 AM Alain Raymond <alai...@gmail.com> wrote:

Hello, I'm trying to replicate the AVA Baseline (for which you provided a model checkpoint) on Pytorch, however I'm getting extremely poor results. My baseline mostly always predicting the Watch (a person) Class (which I understand because it's the most popular one).

Thus I wanted to ask you a few questions about it, if it's okay:

Assumptions
Currently working with AVA 2.2
I'm only interested in classifying actions not bbox localization.

Baseline:
I see from the model checkpoint that it consists of a faster RCNN, coupled with a Resnet101, where we take features from the average pooled features from the 4th block that produces a vector of dimensionality 2048 that is then passed through a linear layer.
The model only takes one frame. Is this accurate? Am I missing an activation function, nonlinearity or another linear layer? As I'm only interested in classification I'm cropping the image according to ground truth and then passing that to the Resnet. This should be the same that you do, yes?

The baseline is just a standard Faster RCNN model, trained for action detection by replacing the person label with the action labels.

There are a few catches:

- Unlike Pascal or COCO, the labels in AVA are not mutually exclusive, you would need to handle this in your loss function (e.g. sigmoid cross-entropy instead of softmax)

- Since it's multi-labeled, the standard classification accuracy could not be used directly. Rather we report per-class average precision. It's possible the prediction scores are not calibrated and biased towards popular classes (although unlikely to always trigger one class), but the AP metric doesn't have this problem since it ranks the predictions within each class.

- The purpose of this baseline is to give a performance lower bound. It has been shown by recent papers (and the I3D baseline in AVA dataset paper) that incorporating temporal context (via 3D ConvNets) are very helpful.

Was the Resnet used for the baseline model fixed during training or trained?

It was pre-trained on ImageNet then fine-tuned.

I'll be using a Resnet50. I expect results to be poorer but not much different from Resnet 101. Or do you think this could be an issue? Did you by any chance try it?

It should be just a bit worse.

Did you use any learning rate scheduling? Warmup or decay? Reduction based on validation loss?

We are now using linear warmup then cosine learning rate decay.

Please note that we found our baseline models underfit on the training set, and training for longer (with a better LR schedule) leads to higher performance.

Did you do anything special to combat the class imbalance in the dataset for the baseline? (like weighted losses or something?)

No.

Dataset

To create my samples from the dataset I'm doing the following:
Aggregating information from the provided csv by:
Video id
Second
Person Id
Target is BoW vector of actions. I.e: [1 0 0 0 1 0 0 0 ...] for that second, that person, that video.
Bbox for that person_id, second, video_id
I'm getting a sample count for training of around 329.000. From the paper I see a count of 210,634 training samples. This seems to be around 50% more than what you have. Even if I consider the 2.8% extra samples that 2.2 provides over 2.1.
Do you do anything else to filter you samples? Do you perhaps consider in the same sample all persons as you are doing bbox localization?

Thank you very much in advance!

Best regards,

Alain Raymond

--

Alain Raymond

unread,

Jun 28, 2019, 6:22:58 PM6/28/19

to AVA Dataset Users

Hi David & Chen!

Thank you so much for your replies!

I have a few followup questions:

Regarding this:

We are now using linear warmup then cosine learning rate decay.
Please note that we found our baseline models underfit on the training set, and training for longer (with a better LR schedule) leads to higher performance.

I was checking the pipeline.config file and found this:

optimizer { momentum_optimizer { learning_rate { manual_step_learning_rate { initial_learning_rate: 0.000300000014249 schedule { step: 1200000 learning_rate: 2.99999992421e-05 } } } momentum_optimizer_value: 0.899999976158 }

Which suggests a constant learning rate of 0.2999999999 e-04 until step 1.200.000, then lowered to 0.29999999 e-05.

Am I incorrect?

Regarding the question about doing anything to handle class imbalance, you said you didn't do anything special.

I was checking the pipeline.config file and found this:

second_stage_classification_loss { weighted_sigmoid { anchorwise_output: true }

I'm a bit confused: did you use any weights for this weighted sigmoid?

I'm sorry to be so particular about this.

Thank you so much!

Best regards,

Alain Raymond

To unsubscribe from this group and stop receiving emails from it, send an email to ava-data...@googlegroups.com.

Chen Sun

unread,

Jun 28, 2019, 6:28:39 PM6/28/19

to Alain Raymond, AVA Dataset Users

Hi Alain,

Yes the baseline used a step LR decay, but we found a cosine schedule + warmup is better (you may want to adjust the LR to be higher than what you use).

The weighted_sigmoid IIUC is just a naming convention, classes should have equal weights in practice.

Chen

To unsubscribe from this group and stop receiving emails from it, send an email to ava-dataset-us...@googlegroups.com.

To post to this group, send email to ava-data...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ava-dataset-users/ce647339-07c1-4f62-88f3-63a782cc3978%40googlegroups.com.

Alain Raymond

unread,

Jun 28, 2019, 6:32:15 PM6/28/19

to AVA Dataset Users

Ok, thank you so much!

Best regards,

Alain

Alain Raymond

unread,

Jun 29, 2019, 2:34:50 PM6/29/19

to AVA Dataset Users

Hello Chen!

Another question: any particular reason to use a batch size of 1 for the baseline?

Best regards,

Alain Raymond

Chen Sun

unread,

Jun 29, 2019, 2:36:33 PM6/29/19

to Alain Raymond, AVA Dataset Users

Hi Alain,

I think the API didn't support batch size of 1 at the time, please feel free to use a larger batch size if it fits your GPU memory.

chen

To unsubscribe from this group and stop receiving emails from it, send an email to ava-dataset-us...@googlegroups.com.

To post to this group, send email to ava-data...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ava-dataset-users/44aed1dc-bb5c-4fcd-ba69-453b10cc3532%40googlegroups.com.

Alain Raymond

unread,

Jul 1, 2019, 3:18:25 PM7/1/19

to AVA Dataset Users

Thank you so much Chen for the reply!

Another question: What is the threshold you used to determine whether you predicted a class as positive or negative? Over 0 on the respective sigmoid? Or did you use another value?

Best regards,

Alain

Aarti Malhotra

unread,

May 12, 2021, 8:08:43 PM5/12/21

to AVA Dataset Users

Hi,

I need to simply use the AVA2.2 pretrained model to predict person to person interaction. Has anyone had success on their custom video clip?

I am trying SlowFast code, but don't seem to get it running on my local

Thanks

Aarti

SHAHARYAR AMJAD

unread,

Jan 20, 2022, 6:27:33 AM1/20/22

to AVA Dataset Users

hello sir can i get source code of ava dataset for putting videos dataset into classes

Reply all

Reply to author

Forward