--
You received this message because you are subscribed to the Google Groups "AVA Dataset Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ava-dataset-us...@googlegroups.com.
To post to this group, send email to ava-data...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ava-dataset-users/1aa3af2c-b0a1-4efb-b101-c5de6a955fd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hello, I'm trying to replicate the AVA Baseline (for which you provided a model checkpoint) on Pytorch, however I'm getting extremely poor results. My baseline mostly always predicting the Watch (a person) Class (which I understand because it's the most popular one).Thus I wanted to ask you a few questions about it, if it's okay:Assumptions
- Currently working with AVA 2.2
- I'm only interested in classifying actions not bbox localization.
Baseline:
- I see from the model checkpoint that it consists of a faster RCNN, coupled with a Resnet101, where we take features from the average pooled features from the 4th block that produces a vector of dimensionality 2048 that is then passed through a linear layer.
- The model only takes one frame. Is this accurate? Am I missing an activation function, nonlinearity or another linear layer? As I'm only interested in classification I'm cropping the image according to ground truth and then passing that to the Resnet. This should be the same that you do, yes?
- Was the Resnet used for the baseline model fixed during training or trained?
- I'll be using a Resnet50. I expect results to be poorer but not much different from Resnet 101. Or do you think this could be an issue? Did you by any chance try it?
- Did you use any learning rate scheduling? Warmup or decay? Reduction based on validation loss?
- Did you do anything special to combat the class imbalance in the dataset for the baseline? (like weighted losses or something?)
DatasetTo create my samples from the dataset I'm doing the following:
- Aggregating information from the provided csv by:
- Video id
- Second
- Person Id
- Target is BoW vector of actions. I.e: [1 0 0 0 1 0 0 0 ...] for that second, that person, that video.
- Bbox for that person_id, second, video_id
- I'm getting a sample count for training of around 329.000. From the paper I see a count of 210,634 training samples. This seems to be around 50% more than what you have. Even if I consider the 2.8% extra samples that 2.2 provides over 2.1.
- Do you do anything else to filter you samples? Do you perhaps consider in the same sample all persons as you are doing bbox localization?
Thank you very much in advance!Best regards,Alain Raymond
--
To unsubscribe from this group and stop receiving emails from it, send an email to ava-data...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to ava-dataset-us...@googlegroups.com.
To post to this group, send email to ava-data...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ava-dataset-users/ce647339-07c1-4f62-88f3-63a782cc3978%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to ava-dataset-us...@googlegroups.com.
To post to this group, send email to ava-data...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ava-dataset-users/44aed1dc-bb5c-4fcd-ba69-453b10cc3532%40googlegroups.com.