谭嘉崴, 丁其川, 白忠玉. 基于视频帧连贯信息的3维人体姿势优化估计方法[J]. 机器人, 2021, 43(1): 9-16.DOI: 10.13973/j.cnki.robot.200023.
TAN Jiawei, DING Qichuan, BAI Zhongyu. Optimal Estimation Method of 3-Dimensional Human Pose Based on Video Frame Coherent Information. ROBOT, 2021, 43(1): 9-16. DOI: 10.13973/j.cnki.robot.200023.
Abstract:For the video-based 3D human pose estimation problem, the traditional method estimates the 3D human pose in each image frame firstly, and then arranges the estimation results according to frame order to obtain the 3D human pose in the video. However, this method doesn't consider the continuity of human motion between consecutive frames, and the spatial consistency of human joint connections, leading to high-frequency jitter and large bias in estimation results. To solve this problem, an optimal estimation method of 3D pose based on coherent information of video frames is presented. Firstly, the 2D pose estimations are utilized to optimize the 3D joint coordinates of the human body, in order to reduce jitter. Secondly, the backward and forward predictions of joint point motion in previous and following frames are introduced to maintain the consistency of movement. Finally, the bone connection constraints are added to establish a model that can maintain the smoothness of the human motion trajectory and optimize the consistency of the joint connection structure before and after the optimization, so as to realize the accurate estimation of the 3D human body pose. The test results on the public data set MPI-INF-3DHP show that compared with the reference 3D pose estimation method, the average error of the joint points estimated by the proposed method is reduced by 3.2%. Test results on the public data set 3DPW show that the acceleration error is reduced by 44% compared with the unoptimized case.
[1] Nakazawa A, Shiratori T. Input device-Motion capture[M]//The Wiley Handbook of Human Computer Interaction. Hoboken, USA:Wiley, 2018:405. [2] Aristidou A, Lasenby J, Chrysanthou Y, et al. Inverse kinematics techniques in computer graphics:A survey[J]. Computer Graphics Forum, 2018, 37(6):35-58. [3] Knippenberg E, Verbrugghe J, Lamers I, et al. Markerless motion capture systems as training device in neurological rehabilitation:A systematic review of their use, application, target population and efficacy[J]. Journal of NeuroEngineering and Rehabilitation, 2017, 14. DOI:10.1186/s12984-017-0270-x. [4] Lun R, Zhao W B. A survey of applications and human motion recognition with Microsoft Kinect[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2015, 29(5). DOI:10.1142/S0218001415550083. [5] Kanazawa A, Zhang J Y, Felsen P, et al. Learning 3D human dynamics from video[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2019:5607-5616. [6] Peng X B, Kanazawa A, Malik J, et al. SFV:Reinforcement learning of physical skills from videos[J]. ACM Transactions on Graphics, 2018, 37(6). DOI:10.1145/3272127.3275014. [7] Loper M, Mahmood N, Romero J, et al. SMPL:A skinned multi-person linear model[J]. ACM Transactions on Graphics, 2015, 34(6). DOI:10.1145/2816795.2818013. [8] Xu W P, Chatterjee A, Zollhöfer M, et al. MonoPerfCap:Human performance capture from monocular video[J]. ACM Transactions on Graphics, 2018, 37(2). DOI:10.1145/3181973. [9] Mehta D, Rhodin H, Casas D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]//International Conference on 3D Vision. Piscataway, USA:IEEE, 2017:506-516. [10] Zhou X Y, Huang Q X, Sun X, et al. Towards 3D human pose estimation in the wild:A weakly-supervised approach[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2017:398-407. [11] Toshev A, Szegedy C. DeepPose:Human pose estimation via deep neural networks[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2014:1653-1660. [12] Newell A, Yang K Y, Deng J. Stacked hourglass networks for human pose estimation[M]//European Conference on Computer Vision. Cham, Switzerland:Springer, 2016:483-499. [13] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:770-778. [14] Andriluka M, Pishchulin L, Gehler P, et al. 2D human pose estimation:New benchmark and state of the art analysis[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2014:3686-3693. [15] Chen X H, Wang G J, Guo H K, et al. Pose guided structured region ensemble network for cascaded hand pose estimation[J]. Neurocomputing, 2020, 395:138-149. [16] Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2019:5686-5696. [17] 于乃功,柏德国. 基于姿态估计的实时跌倒检测算法研究[J/OL].控制与决策. (2019-08-15)[2019-11-25]. DOI:10.13195/j.kzyjc.2019.0382. Yu N G, Bai D G. Research on real-time fall detection algorithm based on pose estimation[J/OL]. Control and Dicision. (201908-15)[2019-11-25]. DOI:10.13195/j.kzyjc.2019.0382. [18] Rogez G, Weinzaepfel P, Schmid C. LCR-Net++:Multi-person 2D and 3D pose detection in natural images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(5):1146-1161. [19] Pavllo D, Feichtenhofer C, Grangier D, et al. 3D human pose estimation in video with temporal convolutions and semisupervised training[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway. USA:IEEE, 2019:7745-7754. [20] Ionescu C, Carreira J, Sminchisescu C. Iterated second-order label sensitive pooling for 3D human pose estimation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway. USA:IEEE, 2014:1661-1668. [21] Ramakrishna V, Kanade T, Sheikh Y. Reconstructing 3D human pose from 2D image landmarks[C]//European Conference on Computer Vision. Berlin, Germany:Springer, 2012:573586. [22] Sminchisescu C. 3D human motion analysis in monocular video:Techniques and challenges[M]//Human Motion. Dordrecht, Netherlands:Springer, 2008:185-211. [23] Ionescu C, Papava D, Olaru V, et al. Human3.6M:Large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7):1325-1339. [24] Mehta D, Sridhar S, Sotnychenko O, et al. VNect:Realtime 3D human pose estimation with a single RGB camera[J]. ACM Transactions on Graphics, 2017, 36(4). DOI:10.1145/3072959.3073596. [25] Mehta D, Sotnychenko O, Mueller F, et al. XNect:Real-time multi-person 3D human pose estimation with a single RGB camera[DB/OL]. (2019-07-01)[2020-01-01]. https://arxiv.org/abs/1907.00837v1. [26] Chen X P, Lin K Y, Liu W T, et al. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway. USA:IEEE, 2019:10887-10896. [27] Wandt B, Rosenhahn B. RepNet:Weakly supervised training of an adversarial reprojection network for 3D human pose estimation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway. USA:IEEE, 2019:7774-7783. [28] Huang F Y, Zeng A L, Liu M H, et al. DeepFuse:An IMUaware network for real-time 3D human pose estimation from multi-view image[C]//IEEE Winter Conference on Applications of Computer Vision. Piscataway. USA:IEEE, 2020:418-427. [29] Kocabas M, Karagoz S, Akbas E. Self-supervised learning of 3D human pose using multi-view geometry[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway. USA:IEEE, 2019:1077-1086. [30] Núñez J C, Cabido R, V élez J F, et al. Multiview 3D human pose estimation using improved least-squares and LSTM networks[J]. Neurocomputing, 2019, 323:335-343. [31] 刘今越,李顺达,陈梦倩,等. 面向移乘搬运护理机器人的人体姿态视觉识别[J].机器人, 2019, 41(5):601-608. Liu J Y, Li S D, Chen M Q, et al. Visual recognition of human pose for the transfer-care assistant robot[J]. Robot, 2019, 41(5):601-608. [32] Arnab A, Doersch C, Zisserman A. Exploiting temporal context for 3D human pose estimation in the wild[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway. USA:IEEE, 2019:3390-3399. [33] Dabral R, Mundhada A, Kusupati U, et al. Learning 3D human pose from structure and motion[C]//European Conference on Computer Vision. Cham, Switzerland:Springer, 2018:679696. [34] Kanazawa A, Black M J, Jacobs D W, et al. End-to-end recovery of human shape and pose[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway. USA:IEEE, 2018:7122-7131. [35] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//13th International Conference on Artificial Intelligence and Statistics. Brookline, USA:Microtome Publishing, 2010:249-256.