Dear ActivityNet Participants,
The time has come! We are ready to make an official release of the 6th installment of the annual International Challenge on Activity Recognition to be held in conjunction with CVPR21 on June 19, 2021. Since the success of the previous ActivityNet Challenges (2016, 2017, 2018, 2019, 2020) and based on your feedback, we have worked hard on making this round richer and more inclusive. We are proud to announce that this year's challenge will be a packed half-day workshop with parallel tracks and will host 12 diverse challenges (16 tasks), which aim to push the limits of semantic visual understanding of videos as well as bridging visual content with human captions.
Two out of the twelve challenges are based on the ActivityNet Dataset. These tasks focus on tracing evidence of activities in time in the form of class labels and captions. In this installment of the challenge, we will host ten guest challenges, which enrich the understanding of visual information in videos. These tasks focus on complementary aspects of the activity recognition problem at large scale and involve challenging and recently compiled datasets. Please see more information about each task appended to this message.
We encourage you to visit the challenge website and go through its details (e.g. task/dataset specifications, important dates, evaluation metrics, toolkits/baselines, and submission guidelines). We have designated one or more contact people for each task. Similar to last year, the ActivityNet Google Group can help you get ActivityNet Challenge and dataset-specific questions answered.
We are looking forward to your submissions and are committed to making your participation a pleasant experience. If you have any questions or comments about the challenge, please contact us.
Please share this email with any individuals who may be interested in the International Challenge on Activity Recognition (ActivityNet).
If you would no longer like to receive communications about the challenge, you may reply “unsubscribe” to this email.
Regards,
ActivityNet Challenge Team
~~~~~~~~~~~~~ Task Descriptions ~~~~~~~~~~~~~
Action RecognitionKinetics-700 Supervised: The supervised track is similar to previous years, except for evaluation. For both tracks, this year we will ask participants to upload one 512d feature vector for each training and test video -- not class scores anymore. We will then train ourselves a linear classifier on top of these feature vectors to determine top-1 and top-5 accuracy and decide on the winning model.
Kinetics-700 Self-supervised: For the self-supervised track we will ask the participants to train on videos from just half of the classes to test for out of domain generalization. Class labels should otherwise not be used for this track -- the goal is to learn representations without them! Participants in this track will be asked to upload feature vectors for videos from all classes in both train and test splits (even for those videos from the split that the model is not trained on).
TinyActions: In this challenge, the focus is on recognizing tiny actions in videos. The existing research in action recognition is mostly focused on high-quality videos where the action is distinctly visible. Therefore, the available action recognition models are not designed for low-resolution videos and their performance is still far from satisfactory when the action is not distinctly visible. This challenge invites solutions for recognizing tiny actions in real-world videos.
ActivityNet Temporal Action Localization: Despite the recent advances in large-scale video analysis, temporal action localization remains as one of the most challenging unsolved problems in computer vision. This search problem hinders various real-world applications ranging from consumer video summarization to surveillance, crowd monitoring, and elderly care. This task is intended to encourage computer vision researchers to design high performance action localization systems.
HACS[Fully-supervised]: In this task, participants will use HACS Segments, a video dataset carefully annotated with a complete set of temporal action segments for the temporal action localization task. Each video can contain multiple action segments. The task is to localize these action segments by predicting the start and end times of each action as well as the action label.
HACS[Weakly-supervised]: In this task, participants will use HACS Segments dataset, but are NOT allowed to use the annotation of the start and end time of action segments. They can still use the class labels of all action segments in the video. Participants are encouraged to explore a weakly-supervised training procedure to learn action localization models.
SoccerNet[Action Spotting]: This task leverages the SoccerNet-v2 dataset, which contains over 500 games covering three seasons of the six major European football leagues. It aims to encourage participants to spot the exact timestamps in the video at which various actions occur, given a professional broadcast of soccer.
Spatio-Temporal Localization
AVA-Kinetics: The AVA-Kinetics task is an umbrella for a crossover of the previous AVA and Kinetics tasks, where Kinetics has now been annotated with AVA labels (but AVA has not been annotated with Kinetics labels). There has always been some interactions between the two datasets, e.g. many of the AVA methods are pre-trained on Kinetics. The new annotations should allow for improved performance on both tasks and also increase the diversity of the AVA evaluation set (which now also includes Kinetics clips).
AVA-ActiveSpeaker: The goal of this task is to evaluate whether algorithms can determine when a visible face is speaking. For this task, participants will use the new AVA-ActiveSpeaker dataset. The purpose of this dataset is to extend the AVA Actions dataset to the task of active speaker detection, and to push the state-of-the-art in multimodal perception: participants are encouraged to use both the audio and video data.
ActEV SDL Unknown Facility (UF): This task seeks to encourage the development of robust automatic activity detection algorithms for an extended video. Challenge participants will develop algorithms to detect and temporally localize instances of Known Activities using an ActEV Command Line Interface (CLI) submission on the Unknown Facility EO video dataset.
ActivityNet Event Dense-Captioning: Most natural videos contain numerous events. For example, in a video of a 'man playing a piano', the video might also contain another 'man dancing' or 'a crowd clapping'. This challenge studies the task of dense-captioning events, which involves both detecting and describing events in a video. This challenge uses the ActivityNet Captions dataset, a large-scale benchmark for dense-captioning events.
ActivityNet Entities Object Localization: This task aims to evaluate how grounded or faithful a description (could be generated or ground-truth) is to the video they describe. An object word is first identified in the description and then localized in the video in the form of a spatial bounding box. The prediction is compared against the human annotation to determine the correctness and overall localization accuracy.
Video Semantic Role Labeling (VidSitu dataset): This challenge evaluates the ability of vision algorithms to understand complex related events in a video. Each event may be described by a verb corresponding to the most salient action in a video segment and its semantic roles. VidSRL involves 3 sub-tasks: (1) predicting a verb-sense describing the most salient action; (2) predicting the semantic roles for a given verb; and (3) predicting event relations given the verbs and semantic roles for two events.SoccerNet[Replay Grounding]: Given a professional broadcast of soccer and replay sequences within that broadcast, spot the exact timestamps in the video at which the actions replayed in the sequences occur.
MMAct: This challenge asks participants to propose cross-modal video action understanding approaches for addressing shortcomings in visual-only approaches, by leveraging both sensor- and vision-based modalities in ways that can overcome the limitations imposed by modality discrepancy between train (sensor + video) and test (only video) phase. Two sub-tasks are provided: (1) Action Recognition; (2) Action Temporal Localization.
HOMAGE: We are releasing a new dataset: Home Action Genome (HOMAGE). This task aims to evaluate the ability of compositional activity recognition in the home by multiple views and multiple sensor modalities. HOMAGE involves 3 tracks for the challenge: (1) Atomic Action Localization, (2) Scene-graph Generation, (3) Privacy Concerned Activity Recognition