Robust Real-Time Pose Estimation with OpenPose

OpenPose represents the first real-time multi-person system to jointly detect human body, hand, facial, and foot keypoints on single images.

Deevid De Meyer - 21/2/2022

As a developer, every once in a while you might come across an application that really makes your heart sing, that makes you wonder: “Where has this been all my life?”. For me, as someone who frequently works on Computer Vision applications, OpenPose is one of those applications.

Now, all hyperbole aside, you might already be wondering what OpenPose does exactly? Well, keeping the idiom “A GIF is worth a thousand pictures” in mind, let’s have a look:

Real time demo of OpenPose pose estimation, including hand and face markers
Real time demo of OpenPose pose estimation, including hand and face markers

Now, depending on your background either your jaw is hanging on the floor in amazement (like me), or you’re wondering why this is so special, considering a Microsoft Kinect provided pretty much the same functionality 7 years ago. To the average person it would seem that OpenPose and Kinect are doing the same thing. And in a sense that’s true, both Kinect and Openpose have the same net result: allowing real time pose estimation of human bodies. The main difference is the hardware: While Kinect uses a special 3D camera, OpenPose works on any old webcam that you might have lying around (It does require a pretty beefy desktop computer though).

But why is this such a big deal? Intuitively, this might be hard to grasp. After all, we humans are just as good at seeing the orientation of a human body with our special 3d cameras (two eyes) as we do with our regular old 2D view (one eye closed). So let’s take some time to explain the essential difference between the two systems

OpenPose vs Kinect

To understand why OpenPose is that much more impressive, we first need to understand how a pose estimation algorithm would work on a Microsoft Kinect. I’ve found a great image explaining a possible algorithm in a 2011 publication:

Algorithm overview from “Accurate 3D Pose Estimation From a Single Depth Image” by Mao Ye, Xianwang Wang, Ruigang Yang, Liu Ren and Marc Pollefeys
Algorithm overview from “Accurate 3D Pose Estimation From a Single Depth Image” by Mao Ye, Xianwang Wang, Ruigang Yang, Liu Ren and Marc Pollefeys

In this algorithm, the first two steps are very important, and it is here that the Kinect hardware is crucial. You see, a Kinect doesn’t simply return an image like a normal camera would. The “image” is actually more like a three dimensional point cloud. This means that we can easily seperate a moving human from the background by simply isolating the points at a certain distance from the camera (This is the “Depth Thresholding” step from the algorithm). Because this data is quite noisy, we then need to do some denoising to arrive at a reasonable 3d model of a human body.

Now it must be noted that the succeeding steps are far from trivial, but the method is pretty similar to the one used by OpenPose. Most importantly, thanks to the Kinect hardware, you essentially get the first two steps for free.

These first two steps are essentially the biggest challenge tackled by OpenPose and boils down to an important research domain in Computer Vision: image segmentation.

Image Segmentation

Basically, using a Microsoft Kinect, you start out with a human shaped blob, and the challenge is to identify the key joints in this blob. Connecting these joints then yields a wireframe skeleton that pretty much completely defines how a human body is oriented.

The big difference is that, instead of a 3D human blob, OpenPose starts out with a completely flat image, containing both the person(s) of interest and the background. Obviously, the first part of the OpenPose algorithm therefore needs to seperate the POI’s from the background. This is exactly what Image Segmentation tries to achieve.

Example of an image segmentation application
Example of an image segmentation application

After performing image segmentation the rest is pretty similar to the Kinect process, with the non-trivial added difficulty of having to start out with a 2D- rather than a 3D-blob. Now I won’t be going down into the gritty details of how OpenPose works (I would suggest you take a look at the papers here and here for more detail) because it requires knowledge of a technique called Convolutional Neural Networks (a form of deep learning, for the buzzword enthousiasts) and other computational wizardry, which is a bit out of scope for this blog.

Applications

Now some of you might still be wondering why this is such a big deal. Drawing some fancy lines on top of a human form might be cool, but what are the applications? Well I’m glad you asked, dear anonymous reader, because they are many.

First of all, you might have heard the popular saying that 93% of all communication is non-verbal. Even though this has been largely debunked, scientists can still agree that body language is a crucial factor in communication. Therefore, any system aiming to analyze human interactions could benefit from an application that is able to mathematically define human posture.

A second potential group is medical applications. For example, pose analysis is an important part of a physiotherapist’s job. Using pose analysis, physiotherapists are able to analyze sitting posture to treat back pain, review athletes’ movements and determine possible movement impairments. Using OpenPose, it might be possible for people to perform these kind of diagnostics by simply using their smartphones.

But there are many other applications: pose estimation can be used to transfer human movement to a virtual 3d model or even allow natural remote control of a robot. In the future, a tool like OpenPose could even be used as the ultimate tool for human computer interaction, allowing for a full range of gesture controls. The following wikipedia article has a few more applications and some more context concerning the research into body pose estimation.

Conclusion

I initially intended to say that OpenPose is one of the most impressive computer vision applications I had seen in a while. However, while contemplating this statement I had to conclude that even though OpenPose is mightily impressive, the last few years have produced a true flood of advancements in the field of AI, and computer vision in particular. Thanks to techniques like convolutional neural networks and other neural network architectures we are entering a true golden age of AI applications, and my feeling is that it won’t be long until I’m talking about the next killer application.

But even though OpenPose isn’t a solitary leap in the field of computer vision, it still managed to get the entire office dancing in front of a webcam just to see the wireframe transposed on their bodies (see picture below, this was the only PG rated frame). Needless to say, we also can’t wait until we’ll be able to leverage to power of OpenPose in one of our projects. And hey, if this little blog inspired you to think of a killer use case for OpenPose, give us a call, we’d be more than happy to talk possibilities ;-)

Just a small sample of the OpenPose enthousiasts at the office
Just a small sample of the OpenPose enthousiasts at the office

Overview