Movinets Mobile Video Networks For Efficient Video Recognition

0 views

Skip to first unread message

Chadwick Bosse

unread,

Aug 5, 2024, 6:47:49 AM8/5/24

to lispafakgess

Videoclassification involves the task of accurately assigning classification labels to multiple consecutive video frames. This can range from just a couple frames within a video or an entire video. While the image classification task is specifically designed to classify individual frames, video classification has the more computationally expensive challenge of incorporating a temporal understanding of classification across multiple frames. As such, this task requires the classification of objects, actions, or scenes within each frame but also incorporates an understanding of the overarching theme or content of the video as a whole. Typical deep learning techniques for video classification span techniques such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and attention mechanisms to capture both frame-level details and the overall narrative or context of the video.

MoViNets: Mobile Video Networks for Efficient Video Recognition (Kondratyuk et al., 2021) is a convolutional neural network architecture developed by Google researchers with a focus on efficient video recognition, particularly suited for mobile devices. Its design prioritizes computational efficiency while maintaining high accuracy, making it ideal for tasks like real-time video analysis on smartphones and tablets.

To get started, log in or create an account with Datature Nexus. In your workspace, you can create a project with the project type of Classification and specify your data type as Videos or Images and Videos.

To fine-tune a MoViNet model, you can create a workflow that will fine-tune a model with your annotated data. With Datature, you can choose to train a MoViNet model with pre-trained weights from the Kinetics-600 dataset (Carreira et al., 2018) and continue from a trained artifact of the same model type on Nexus. Datature offers architectures from A0 to A5 with resolutions from 172x172 to 320x320.

MoViNet also offers a few different hyperparameters to tune your MoViNet architectures with batch size, frame size (number of frames processed at once), frame stride, discard threshold, and training steps.

To monitor your training and model performance, you can view and analyze the metrics curves in real-time on the Trainings page, as well as visualize predictions through our Advanced Evaluation and Confusion Matrix tools.

With your trained artifacts, you can quickly deploy your model on the cloud with both CPU and GPU. MoViNet is deployed for streaming video, utilizing stream buffers to keep inference computationally efficient and ensure that as the clips are passed through, the temporal context information is kept for improved accuracy.

You can easily try this out on your own video data by following the steps above with your own Datature Nexus account. With our Free tier account, you can perform the steps without any credit card or payment required and can certainly test the steps within the limits of the account quota.

You can always compare if image classification or video classification is more well suited for your context by training an image classification model to compare, which for simpler use cases, can benefit deployment with faster inference. To learn more about training image classification, you can read this article.

Alongside video classification, we will be developing support for more video and 3D data inputs. As such, users can look out for model training support for 3D medical models and other video related models for action recognition and action classification tasks. As always, user feedback is welcome and if there are any particular models in this 3D or temporal space that you feel should be on the platform, please feel free to reach out and let us know!

MoViNets are a family of convolutional neural networks which efficiently process video streams, outputting accurate predictions with a fraction of the latency of convolutional video classifiers like 3D ResNets or transformer-based classifiers like ViT.

Frame-based classifiers output predictions on each 2D frame independently, resulting in sub-optimal performance due to their lack of temporal reasoning. On the other hand, 3D video classifiers offer high accuracy predictions by processing all frames in a video clip simultaneously, at a cost of significant memory and latency penalties as the number of input frames increases. MoViNets offer key advantages from both 2D frame-based classifiers and 3D video classifiers while mitigating their disadvantages.

The following figure shows a typical approach to using 3D networks with multi-clip evaluation, where the predictions of multiple overlapping subclips are averaged together. Shorter subclips result in lower latency, but reduce the overall accuracy.

MoViNets take a hybrid approach, which proposes the use of causal convolutions in place of 3D convolutions, allowing intermediate activations to be cached across frames with a Stream Buffer. The Stream Buffer copies the input activations of all 3D operations, which is output by the model and then input back into the model on the next clip input.

The result is that MoViNets can receive one frame input at a time, reducing peak memory usage while resulting in no loss of accuracy, with predictions equivalent to inputting all frames at once like a 3D video classifier. MoViNets additionally leverage Neural Architecture Search (NAS) by searching for efficient configurations of models on video datasets (specifically Kinetics 600) across network width, depth, and resolution.

The result is a set of action classifiers that can output temporally-stable predictions that smoothly transition based on frame content. Below is an example plot of MoViNet-A2 making predictions on each frame on a video clip of skateboarding. Notice how the initial scene with a small amount of motion has relatively constant predictions, while the next scene with much larger motion causes a dramatic shift in predicted classes.

MoViNets need a few modifications to be able to run effectively on edge devices. We start with MoViNet-A0-Stream, MoViNet-A1-Stream, and MoViNet-A2-Stream, which represent the smaller models that can feasibly run in real time (20 fps or higher). To effectively quantize MoViNet, we adapt a few modifications to the model architecture - the hard swish activation is replaced by ReLU6, and Squeeze-and-Excitation layers are removed in the original architectures, which results in 3-4 p.p accuracy drop on Kinetics-600. We then convert the models to TensorFlow Lite and use integer-based post-training quantization (as well as float16 quantization) to reduce the model sizes and make them run faster on mobile CPUs. The integer-based post-training quantization process further introduces 2-3 p.p. accuracy loss. Compared to the original MoViNets, quantized MoViNets lag behind in accuracy on full 10-second Kinetics 600 clips (5-7 p.p. accuracy reduction in total), but in practice they are able to provide very accurate predictions on daily human actions, e.g., push ups, dancing, and playing piano. In the future, we plan to train with quantization-aware training to bridge this accuracy gap.

We benchmark quantized A0, A1, and A2 on real hardware and the model inference time achieves 200, 120, and 60 fps respectively on Pixel 4 CPU. In practice, due to the input pipeline overhead, we see increased latency closer to 20-60 fps when running on Android with a camera as input.

You can train your own video classifier model using the MoViNet codebase in the TensorFlow Model Garden. The provided Colab notebook provides specific steps on how to fine-tune a pretrained video classifier on another dataset.

We are excited to see on-device online video action recognition powered by MoViNets, which demonstrate highly efficient performance. In the future, we plan to support quantize-aware training for MoViNets to mitigate the quantization accuracy loss. We also are interested in extending MoViNets as the backbone for more on-device video tasks, e.g. video object detection, video object segmentation, visual tracking, pose estimation, and more.

We would like to extend a big thanks to Yeqing Li for supporting MoViNets in TensorFlow Model Garden, Boqing Gong, Huisheng Wang, and Ting Liu for project guidance, Lu Wang for code reviews, and the TensorFlow Hub team for hosting our models.

In machine learning, video classification is the solution of taking video frames as input and predicting a single class from a larger set of classes as output. This makes it important for the video action recognition model to consider the content of each frame. It can also understand the spatial relationships between adjacent frames and the actions in the video.

3D convolutional neural networks are an extended version of 2D CNNs and are used to extract sequential images and learn spatiotemporal information from videos. While they can learn the correlation of temporal changes between adjacent frames without employing additional temporal learning methods, 3D CNNs have a huge inherent disadvantage. They have high computational complexity and excessive memory usage. Furthermore, they do not support online inference, making them difficult to work on mobile devices. Even the recent X3D networks provide increased efficiency and fall short in one way or another. They require extensive memory resources on large temporal windows, which incur high costs, or on small temporal windows, which reduce accuracy. Hence, there is a large gap between the video model performance of accurate models and efficient models for video action recognition. 2D MobileNet CNNs are fast and can operate on streaming video in real-time but are prone to be noisy and inaccurate.

MoViNets is trained on the Kinetics-600 dataset, a collection of a large-scale and high-quality set of URL links to 650,000 video clips. The dataset consists of human-annotated clips observing 400/600/700 human action classes, including human-object and human-human interactions. Through the training, MoViNets can identify 600 human actions like playing the trumpet, robot dancing or bowling. It can also classify video streams captured on a modern smartphone in real-time.