Human Mesh Recovery and Motion Capture

Yuan-Peng Yu1, Scott John Easley1
1University of Southern California

Recover 3D human mesh from monocular image and capture motion from the input video.


Abstract

Text-to-image models, like Stable Diffusion, challenges persist in generating anatomically accurate images of humans. Generated images contain wrong numbers of hands or fingers, particularly when image subjects are depicted holding items, when their hands are intertwined, or when the view angle is obstructed. To provide a correct human prior for text-to-image models, our paper present a human mesh recovery and motion capture framework. The framework produces 3D human mesh from monocular images and videos to capture precise poses of human body, face, and hands. Our method capture body motion in real-time and the pose of generated full-body parametric model (SMPL-X) is highly aligned to the inputs. We demonstrate the effectiveness of our model on various challenging real-world datasets and apply the model in multiple downstream tasks.


Video (Target)

(Source: PyMAF-X: Towards Well-aligned Full-body Model Regression from Monocular Images)


Target Result 1

Reload page to sync videos if needed.

Input Video
SMPL-X Overlay
Input Video
SMPL-X Overlay

(Source: SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation)


Target Result 2

  • To enable Stable Diffusion to replicate the required human pose, our framework provide 2D pose estimations as input conditions that Controlnet can adopt to precisely control the placement of image subjects. Through motion capture, people can present the desired pose and accquire correct pose estimations of human body, face, and hands.
  • The 3D human mesh model and its depth map can also be used for ControlNet. Thus, Stable Diffusion can create image with the required human pose.
Stable Diffusion Image reference.

(Source: https://toyxyz.gumroad.com/l/ciojz)


Target Result 3

The generted 3D human mesh can be observed 360 degrees by varying camera angles.

ECON Image reference.

(Source: ECON: Explicit Clothed humans Optimized via Normal integration)

VIBE Image reference.

(Source: VIBE: Video Inference for Human Body Pose and Shape Estimation)

The 3D human model can be animated to another character.

ROMP Image reference.

(Source: ROMP: Monocular, One-stage, Regression of Multiple 3D People)


Acknowledgements

Part of the project codes is refered from the following excellent projects, including PyMAF-X, ROMP, ECON.


BibTeX

@article{yu2023hmr,
      author    = {Yuan-Peng Yu, Scott John Easley},
      title     = {Human Mesh Recovery and Motion Capture},
      journal   = {},
      year      = {2023},
      }