|
|
A Robotic Command Generation Framework Based on Human Demonstration Videos |
MO Xiuyun, CHEN Junhong, YANG Zhenguo, LIU Wenyin |
School of Computer, Guangdong University of Technology, Guangzhou 510006, China |
|
|
Abstract In order to improve the robot's ability of skills learning and avoid the manual teaching process,a sequence-to-sequence framework is proposed to automatically generate robotic commands based on the observation of human demonstration videos without any special marks.Firstly,a Mask R-CNN (region-based convolutional neural network) is used to reduce the manipulation area,and a two-stream I3D network (inflated 3D convolutional network) is adopted to extract the optical features as well as the RGB features from the videos.Secondly,a bidirectional LSTM (long short-term memory) network is introduced for acquiring the context information from the previous features extracted.Finally,self-attention and global attention mechanisms are integrated to learn the correlation between a sequence of video frames and a sequence of commands,and the sequence-to-sequence model ultimately outputs the robotic commands.Experiments are extensively conducted on the expanded MPII Cooking 2 dataset and the IIT-V2C dataset.Compared with the existing methods,the proposed method has a current state-of-the-art performance on indicators such as BLEU_4(0.705) and METEOR (0.462).The results show that the proposed method can learn manipulation tasks from human demonstration videos.In particular,the framework is successfully applied to a Baxter robot.
|
Received: 07 December 2020
Published: 22 March 2022
|
|
|
|
|
|
|
|