Abstract:To study the influence of different residual connection methods on CNN (convolutional neural network) for human motion prediction, this paper investigates how to use residual connection to construct an effective prediction model for capturing the human motion features by the network with a certain depth. Through observing the arrangement of human skeletal joints, a symmetric residual connection method is proposed for predicting the human skeletal joints, and a symmetric residual block (SRB) is designed based on the proposed method. In the designed SRB, the receptive field of the last convolution kernel is maximized, covering all the joint information of the human body. The symmetric connection method is adopted to efficiently utilize the shallow dynamic features, and consequently improve the prediction performance and reduce the model parameters. Based on two SRBs and one decoder, an end-to-end convolutional network is proposed, named as symmetric residual network (SRNet), by which a higher accuracy is achieved comparing with the baseline methods. In the framework of TensorFlow, human motion prediction experiments are carried out on two public datasets, Human3.6M and CMU-Mocap. The results indicate that, the proposed method reduces the mean per joint position error (MPJPE) by 0.2 mm~1 mm at each prediction time point comparing with the baseline methods, which confirms the effectiveness of the proposed SRNet for modeling the human global spatial features.
[1] 刘今越,李顺达,陈梦倩,等.面向移乘搬运护理机器人的人体姿态视觉识别[J].机器人,2019,41(5):601-608.Liu J Y, Li S D, Chen M Q, et al. Visual recognition of human pose for the transfer-care assistant robot[J]. Robot, 2019, 41(5): 601-608. [2] 林安迪,干旻峰,葛涵,等.基于模糊模型参考学习控制的手术机器人人机交互[J].机器人,2019,41(4):543-550.Lin A D, Gan M F, Ge H, et al. Human-robot interaction for surgical robot based on fuzzy model reference learning control[J]. Robot, 2019, 41(4): 543-550. [3] 马淼,李贻斌.基于多级动态模型的 2 维人体姿态估计[J].机器人,2016,38(5):578-587.Ma M, Li Y B. 2D human pose estimation using multi-level dynamic model[J]. Robot, 2016, 38(5): 578-587. [4] 谭嘉崴,丁其川,白忠玉.基于视频帧连贯信息的 3 维人体姿势优化估计方法[J].机器人,2021,43(1):9-16.Tan J W, Ding Q C, Bai Z Y. Optimal estimation method of 3-dimensional human pose based on video frame coherent information[J]. Robot, 2021, 43(1): 9-16. [5] Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. [6] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//25th International Conference on Neural Information Processing Systems, Vol.1. New York, USA: ACM, 2012: 1097-1105. [7] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2016: 770-778. [8] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2015: 1-9. [9] Zhou H H, Guo C L, Zhang H, et al. Learning multiscale correlations for human motion prediction[DB/OL]. (2021-05-19) [2021-07-12]. https://arxiv.org/pdf/2103.10674.pdf. [10] Lebailly T, Kiciroglu S, Salzmann M, et al. Motion prediction using temporal inception module[M]//Lecture Notes in Computer Science, Vol.12623. Berlin, Germany: Springer, 2020: 651-665. [11] Li B, Tian J, Zhang Z F, et al. Multitask non-autoregressive model for human motion prediction[J]. IEEE Transactions on Image Processing, 2020, 30: 2562-2574. [12] Li M S, Chen S H, Zhao Y H, et al. Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2020: 211-220. [13] Liu Z G, Lyu K, Wu S, et al. Aggregated multi-GANs for controlled 3D human motion prediction[DB/OL]. (2021-03-17) [2021-03-23]. https://arxiv.org/abs/2103.09755. [14] Hernandez A, Gall J, Moreno F. Human motion prediction via spatio-temporal inpainting[C]//IEEE/CVF International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 7133-7142. [15] Liu X L, Yin J Q, Liu J, et al. TrajectoryCNN: A new spatio-temporal feature learning network for human motion prediction [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(6): 2133-2146. [16] He K M, Zhang X Y, Ren S Q, et al. Identity mappings in deep residual networks[M]//Lecture Notes in Computer Science, Vol.9908. Berlin, Germany: Springer, 2016: 630-645. [17] Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2017: 2261-2269. [18] Martinez J, Black M J, Romero J. On human motion prediction using recurrent neural networks[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA: IEEE, 2017: 4674-4683. [19] Karen S, Andrew Z. Very deep convolutional networks for large-scale image recognition[DB/OL]. (2015-04-10) [2020-10-20]. https://arxiv.org/abs/1409.1556. [20] Ionescu C, Papava D, Olaru V, et al. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(7): 1325-1339. [21] Mao W, Liu M, Salzmann M, et al. Learning trajectory dependencies for human motion prediction[C]//IEEE/CVF International Conference on Computer Vision. Piscataway, USA: IEEE, 2019: 9488-9496. [22] Cai Y J, Huang L, Wang Y W, et al. Learning progressive joint propagation for human motion prediction[M]//Lecture Notes in Computer Science, Vol.12352. Berlin, Germany: Springer, 2020: 226-242. [23] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C]//31st International Conference on Neural Information Processing Systems. New York, USA: ACM, 2017: 6000-6010.