Abstract:
In order to improve the robot's ability of skills learning and avoid the manual teaching process, a sequence-to-sequence framework is proposed to automatically generate robotic commands based on the observation of human demonstration videos without any special marks.Firstly, a Mask R-CNN (region-based convolutional neural network) is used to reduce the manipulation area, and a two-stream I3D network (inflated 3D convolutional network) is adopted to extract the optical features as well as the RGB features from the videos.Secondly, a bidirectional LSTM (long short-term memory) network is introduced for acquiring the context information from the previous features extracted.Finally, self-attention and global attention mechanisms are integrated to learn the correlation between a sequence of video frames and a sequence of commands, and the sequence-to-sequence model ultimately outputs the robotic commands.Experiments are extensively conducted on the expanded MPII Cooking 2 dataset and the IIT-V2C dataset.Compared with the existing methods, the proposed method has a current state-of-the-art performance on indicators such as BLEU_4(0.705) and METEOR (0.462).The results show that the proposed method can learn manipulation tasks from human demonstration videos.In particular, the framework is successfully applied to a Baxter robot.