莫秀云, 陈俊洪, 杨振国, 刘文印. 基于人类演示视频的机器人指令生成框架[J]. 机器人, 2022, 44(2): 186-194,202.DOI: 10.13973/j.cnki.robot.200539.
MO Xiuyun, CHEN Junhong, YANG Zhenguo, LIU Wenyin. A Robotic Command Generation Framework Based on Human Demonstration Videos. ROBOT, 2022, 44(2): 186-194,202. DOI: 10.13973/j.cnki.robot.200539.
Abstract:In order to improve the robot's ability of skills learning and avoid the manual teaching process,a sequence-to-sequence framework is proposed to automatically generate robotic commands based on the observation of human demonstration videos without any special marks.Firstly,a Mask R-CNN (region-based convolutional neural network) is used to reduce the manipulation area,and a two-stream I3D network (inflated 3D convolutional network) is adopted to extract the optical features as well as the RGB features from the videos.Secondly,a bidirectional LSTM (long short-term memory) network is introduced for acquiring the context information from the previous features extracted.Finally,self-attention and global attention mechanisms are integrated to learn the correlation between a sequence of video frames and a sequence of commands,and the sequence-to-sequence model ultimately outputs the robotic commands.Experiments are extensively conducted on the expanded MPII Cooking 2 dataset and the IIT-V2C dataset.Compared with the existing methods,the proposed method has a current state-of-the-art performance on indicators such as BLEU_4(0.705) and METEOR (0.462).The results show that the proposed method can learn manipulation tasks from human demonstration videos.In particular,the framework is successfully applied to a Baxter robot.
[1] Nguyen A, Do T T, Reid I, et al. V2CNet:A deep learning framework to translate videos to commands for robotic manipulation[DB/OL]. (2019-03-23)[2020-11-12]. https://arxiv.org/abs/1903.10869v1. [2] Koppula H S, Saxena A. Anticipating human activities for reactive robotic response[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2013:2071. [3] Roncone A, Mangin O, Scassellati B. Transparent role assignment and task allocation in human robot collaboration[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2017:1014-1021. [4] Huang J L, Wang Z F, Li L G, et al. The intelligent flexible welding system for robot based on double station[C]//IOP Conference Series:Materials Science and Engineering. Bristol, UK:IOP Publishing, 2018. DOI:10.1088/1757-899X/394/4/042070. [5] Amor H B, Berger E, Vogt D, et al. Kinesthetic bootstrapping:Teaching motor skills to humanoid robots through physical interaction[M]//Lecture Notes in Computer Science, Vol.5803. Berlin, Germany:Springer, 2009:492-499. [6] Pandey P, Kumar K, Nandi G C. Imitation of human motion on a humanoid robot using inverse kinematics and path optimization[C]//2nd International Conference on Intelligent Computing and Control Systems. Piscataway, USA:IEEE, 2018:1863-1868. [7] Steinmetz F, Montebelli A, Kyrki V. Simultaneous kinesthetic teaching of positional and force requirements for sequential incontact tasks[C]//IEEE-RAS 15th International Conference on Humanoid Robots. Piscataway, USA:IEEE, 2015:202-209. [8] Pastor P, Righetti L, Kalakrishnan M, et al. Online movement adaptation based on previous sensor experiences[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2011:365-371. [9] Liu H X, Xie X, Millar M, et al. A glove-based system for studying hand-object manipulation via joint pose and force sensing[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2017:6617-6624. [10] Dillmann R. Teaching and learning of robot tasks via observation of human performance[J]. Robotics and Autonomous Systems, 2004, 47(2-3):109-116. [11] Koenemann J, Burget F, Bennewitz M. Real-time imitation of human whole-body motions by humanoids[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2014:2806-2812. [12] Xie X, Liu H X, Edmonds M, et al. Unsupervised learning of hierarchical models for hand-object interactions[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2018:4097-4102. [13] Cherian A, Fernando B, Harandi M, et al. Generalized rank pooling for activity recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:1581-1590. [14] Kar A, Rai N, Sikka K, et al. AdaScan:Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:5699-5708. [15] Varol G, Laptev I, Schmid C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(6):1510-1517. [16] Bilen H, Fernando B, Gavves E, et al. Dynamic image networks for action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:3034-3042. [17] Rahmani H, Bennamoun M. Learning action recognition model from depth and skeleton videos[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2017:5833-5842. [18] Ma M, Marturi N, Li Y B, et al. Region-sequence based sixstream CNN features for general and fine-grained human action recognition in videos[J]. Pattern Recognition, 2018, 76:506-521. [19] Zhang Q X, Chen J H, Liang D Y, et al. An object attribute guided framework for robot learning manipulations from human demonstration videos[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, USA:IEEE, 2019:6113-6119. [20] Venugopalan S, Xu H J, Donahue J, et al. Translating videos to natural language using deep recurrent neural networks[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, USA:ACL, 2015:1494-1504. [21] Nguyen A, Kanoulas D, Muratore L, et al. Translating videos to commands for robotic manipulation with deep recurrent neural networks[C]//IEEE International Conference on Robotics and Automation. Piscataway, USA:IEEE, 2018:3782-3788. [22] Yang Y Z, Li Y, Fermuller C, et al. Robot learning manipulation action plans by "watching" unconstrained videos from the world wide web[C]//29th AAAI Conference on Artificial Intelligence. Menlo Park, USA:AAAI Press, 2015. [23] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//27th International Conference on Neural Information Processing Systems. Cambridge, USA:MIT Press, 2014:3104-3112. [24] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO:Common objects in context[M]//Lecture Notes in Computer Science, Vol.8693. Berlin, Germany:Springer, 2014:740-755. [25] Rohrbach M, Rohrbach A, Regneri M, et al. Recognizing finegrained and composite activities using hand-centric features and script data[J]. International Journal of Computer Vision, 2016, 119:346-373. [26] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:4724-4733. [27] Ioffe S, Szegedy C. Batch normalization:Accelerating deep network training by reducing internal covariate shift[C]//32nd International Conference on Machine Learning. New York, USA:ACM, 2015, 37:448-456. [28] Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5-6):602-610. [29] Lin Z H, Feng M W, Santos C N, et al. A structured selfattentive sentence embedding[C/OL]//International Conference on Learning Representations. (2017-02-21)[2020-11-12]. https://openreview.net/forum?id=BJCjUqxe. [30] Luong T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation[C]//Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA:ACL, 2015:1412-1421. [31] Kuehne H, Arslan A, Serre T. The language of actions:Recovering the syntax and semantics of goal-directed human activities[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2014:780-787. [32] Xu J, Mei T, Yao T, et al. MSR-VTT:A large video description dataset for bridging video and language[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2016:5288-5296. [33] Papineni K, Roukos S, Ward T, et al. BLEU:A method for automatic evaluation of machine translation[C]//40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA:ACL, 2002:311-318. [34] Banerjee S, Lavie A. METEOR:An automatic metric for MT evaluation with improved correlation with human judgments[C]//ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, USA:ACL, 2005:65-72. [35] Lin C Y. ROUGE:A package for automatic evaluation of summaries[C]//Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL. Stroudsburg, USA:ACL, 2004:74-81. [36] Vedantam R, Zitnick C L, Parikh D. CIDEr:Consensus-based image description evaluation[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2015:4566-4575. [37] Ramanishka V, Das A, Zhang J M, et al. Top-down visual saliency guided by captions[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:3135-3144. [38] Chen X, Fang H, Lin T Y, et al. Microsoft COCO captions:Data collection and evaluation server[DB/OL]. (2015-04-3)[2020-11-12]. https://arxiv.org/abs/1504.00325v2. [39] Venugopalan S, Rohrbach M, Donahue J, et al. Sequence to sequence-Video to text[C]//IEEE International Conference on Computer Vision. Piscataway, USA:IEEE, 2015:4534-4542. [40] Gan Z, Gan C, He X D, et al. Semantic compositional networks for visual captioning[C]//IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, USA:IEEE, 2017:1141-1150. [41] Quigley M, Gerkey B, Conley K, et al. ROS:An open-source robot operating system[A/OL].[2020-11-12]. http://www.robotics.stanford.edu/~ang/papers/icraoss09-ROS.pdf. [42] Chitta S, Sucan I, Cousins S. MoveIt![J]. IEEE Robotics and Automation Magazine, 2012, 19(1):18-19. [43] Bochkovskiy A, Wang C Y, Liao H Y M. YOLOv4:Optimal speed and accuracy of object detection[DB/OL]. (2020-04-23)[2020-11-12]. https://arxiv.org/abs/2004.10934.