Today we are announcing the release of a new approach to human body pose perception, BlazePose, which we presented at the CV4ARVR workshop at CVPR 2020. Our approach provides human pose tracking by employing machine learning (ML) to infer 33, 2D landmarks of a body from a single frame. In contrast to current pose models based on the standard COCO topology, BlazePose accurately localizes more keypoints, making it uniquely suited for fitness applications. In addition, current state-of-the-art approaches rely primarily on powerful desktop environments for inference, whereas our method achieves real-time performance on mobile phones with CPU inference. If one leverages GPU inference, BlazePose achieves super-real-time performance, enabling it to run subsequent ML models, like face or hand tracking.
The current standard for human body pose is the COCO topology, which consists of 17 landmarks across the torso, arms, legs, and face. However, the COCO keypoints only localize to the ankle and wrist points, lacking scale and orientation information for hands and feet, which is vital for practical applications like fitness and dance. The inclusion of more keypoints is crucial for the subsequent application of domain-specific pose estimation models, like those for hands, face, or feet.
With BlazePose, we present a new topology of 33 human body keypoints, which is a superset of COCO, BlazeFace and BlazePalm topologies. This allows us to determine body semantics from pose prediction alone that is consistent with face and hand models.
For real-time performance of the full ML pipeline consisting of pose detection and tracking models, each component must be very fast, using only a few milliseconds per frame. To accomplish this, we observe that the strongest signal to the neural network about the position of the torso is the person's face (due to its high-contrast features and comparably small variations in appearance). Therefore, we achieve a fast and lightweight pose detector by making the strong (yet for many mobile and web applications valid) assumption that the head should be visible for our single-person use case.
The pose estimation component of the pipeline predicts the location of all 33 person keypoints with three degrees of freedom each (x, y location and visibility) plus the two virtual alignment keypoints described above. Unlike current approaches that employ compute-intensive heatmap prediction, our model uses a regression approach that is supervised by a combined heat map/offset prediction of all keypoints, as shown below.
To cover a wide range of customer hardware, we present two pose tracking models: lite and full, which are differentiated in the balance of speed versus quality. For performance evaluation on CPU, we use XNNPACK; for mobile GPUs, we use the TFLite GPU backend.
I am currently playing around with Custom Blazepose Model ( here is the repo and code ). I am facing problem to visualize the predictions. I was checking the source code and I found the model returns 3 outputs:
The aim is to develop a computer-based assessment model for novel dynamic postural evaluation using RULA. The present study proposed a camera-based, three-dimensional (3D) dynamic human pose estimation model using 'BlazePose' with a data set of 50,000 action-level-based images. The model was investigated using the Deep Neural Network (DNN) and Transfer Learning (TL) approach. The model has been trained to evaluate the posture with high accuracy, precision, and recall for each output prediction class. The model can quickly analyse the ergonomics of dynamic posture online and offline with a promising accuracy of 94.12%. A novel dynamic postural estimator using blaze pose and transfer learning is proposed and assessed for accuracy. The model is subjected to a constant muscle loading factor and foot support score that could evaluate one person with good image clarity at a time.Practitioner summary: A detailed investigation of dynamic work postures is largely missing in the literature. Experimental analysis has been performed using transfer learning, BlazePose, and RULA action levels. An overall accuracy of 94.12% is achieved for dynamic postural assessment.
Its edge is that it is more suitable for applications like fitness ,rehabilitation and dance than already existing models. Why? It is more accurate. It also localizes more keypoints than other previous models do to be more suitable for applications where the scale and orientation of hands and feet are vital information.
The returned poses list contains detected poses for each individual in the image.For single-person models, there will only be one element in the list. Currently,only PoseNet supports multi-pose estimation. If the model cannot detect any poses,the list will be empty.
The score ranges from 0 to 1. It represents the model's confidence of a keypoint.Usually, keypoints with low confidence scores should not be used. Each applicationmay require a custom confidence threshold. For applications that require high precision,we recommend a larger confidence value. Conversely, applications that require high recallmay choose to lower the threshold. The confidence values are not calibrated between models,and therefore setting a proper confidence threshold may involve some experimentation.
If anyone has luck with finding alternatives to wrnch or has recommendations for platforms/models that are able to handle tracking multiple poses/users on a 2D image at a high frame rate, I would love to look into some other options. Thanks!
Pose Estimation allows detecting athletic movements such as yoga, weight lifting, squats etc. Pose estimation models allow us to track joint positions such as shoulders, elbows, hips in real-time. These fitness routines can be built digitally which are prescribed by therapists.
The Google AI Tensorflow team introduced various pose estimation models in the past couple of years with a variety of model architectures: Posenet, MoveNet Model and Blazepose. All these models have various variants of model architectures.
Blazepose model is offered by MediaPipe and it infers 33 key points of a human body (in 2D space and 3D space versions) where PoseNet and MoveNet infer 17 key points in 2D space. MoveNet is considered the new generation version of PoseNet. All 3 models are available on Tensorflow Hub in several runtime formats such as Tensorflow Javascript (TF.js), TFLite and Coral (Edge TPU).
TFLite model format allows us to deploy ML models in mobile and IoT devices and run on-device inference. TensorFlow Lite is a lightweight version (around 1MB binary vs >1GB for a full Tensorflow install) of TensorFlow designed for mobile and embedded devices. TensorFlow Lite models are 5-10x compressed versions of full TF models. TFLite models usually measure in 10s of MB vs 100s of MB for the original models. The compression is done via techniques such as quantization, which converts 32-bit parameter data into 8-bit representations.
Until recently it was not possible to train a model directly with TensorFlow Lite. We had to first train a model with TensorFlow/Keras, then save the trained model, convert it to a TFLite model using TensorFlow Lite converter and then deploy it on an edge device.
In 2020 the TensorFlow team introduced TensorFlow Lite Model Maker that enables us to train certain types of models on-device with custom datasets. It uses a transfer learning approach to reduce the required amount of training data and shorten the training time.
Tensorflow Lite enables us to deploy models on devices with CPU only as well as devices with support for Edge TPU. Edge TPU provides a specific set of neural network operations and architectures and it is capable of executing deep neural networks about 10x faster than a CPU. It supports only TensorFlow Lite models that are fully 8-bit quantized and then compiled specifically for the Edge TPU.
PoseNet is an older generation pose estimation model released in 2017. It is trained on a standard COCO dataset and provides a single pose and multiple pose estimation variants. The single pose variant can detect only one person in an image/video and the multi pose variant can detect multiple persons in an image/video. Both variants have their own set of parameters and methodology. Single pose estimation is simpler and faster but required to have a single person in an image/video otherwise key points from multiple persons will likely be estimated as being part of a single subject.
MoveNet is the latest generation pose estimation model released in 2021. MoveNet is an ultra-fast and accurate model that detects 17 key points of a body. MoveNet has two variants known as Lightning and Thunder. Lightning is meant for latency-critical applications, while Thunder is meant for applications that require high accuracy. Both variants support 30+ FPS on most modern desktops, laptops, and phones. MoveNet outperforms PoseNet on a variety of datasets.
One of the promising use cases of the pose estimation model is to detect human falls. By analyzing frame sequences using pose estimation, we can predict fall motions. A simple and effective heuristic approach to detect a fall is to compare the angle between the spinal vectors of a person in before and after images from a fall sequence.
The base pose detection models perform well in cases with low lighting although they do have a limit. We actually saw examples where the ML models did well even when it was hard for a human eye to distinguish objects in a room with dim lighting.
People personalize their spaces in a variety of ways. Some are minimalists in their choice of furniture and wall colors yet others are happier with a rainbow of colors and objects around them. The latter seems to present a notable challenge to computer vision models that have not seen such unique home decors.
For the time being, we advise users to keep areas of high risk falls clear of clutter, but we are working on introducing a feedback loop that would allow users to help their local models learn about their personal space and reduce mistakes.
df19127ead