基于人类演示视频的机器人指令生成框架

莫秀云; 陈俊洪; 杨振国; 刘文印

doi:10.13973/j.cnki.robot.200539

基于人类演示视频的机器人指令生成框架

A Robotic Command Generation Framework Based on Human Demonstration Videos

摘要

摘要: 为了提高机器人学习技能的能力，免除人工示教过程，本文基于对无特殊标记的人类演示视频的观察，提出了一种基于序列到序列模式的机器人指令自动生成框架。首先，使用Mask R-CNN（区域卷积神经网络）来缩小操作区域的范围，并采用双流I3D网络（膨胀3D卷积网络）从视频中提取光流特征和RGB特征；其次，引入双向LSTM（长短期记忆）网络从先前提取的特征中获取上下文信息；最后，使用自我注意力机制和全局注意力机制，学习视频帧序列和命令序列的关联性，序列到序列模型最终输出机器人的命令。在扩展后的MPII烹饪活动2数据集和IIT-V2C数据集上进行了大量的实验，与现有的方法进行比较，本文提出的方法在BLEU_4（0.705）和METEOR（0.462）等指标上达到目前最先进性能水平。结果表明，该方法能够从人类演示视频中学习操作任务。此外，本框架成功应用于Baxter机器人。

Abstract: In order to improve the robot's ability of skills learning and avoid the manual teaching process, a sequence-to-sequence framework is proposed to automatically generate robotic commands based on the observation of human demonstration videos without any special marks.Firstly, a Mask R-CNN (region-based convolutional neural network) is used to reduce the manipulation area, and a two-stream I3D network (inflated 3D convolutional network) is adopted to extract the optical features as well as the RGB features from the videos.Secondly, a bidirectional LSTM (long short-term memory) network is introduced for acquiring the context information from the previous features extracted.Finally, self-attention and global attention mechanisms are integrated to learn the correlation between a sequence of video frames and a sequence of commands, and the sequence-to-sequence model ultimately outputs the robotic commands.Experiments are extensively conducted on the expanded MPII Cooking 2 dataset and the IIT-V2C dataset.Compared with the existing methods, the proposed method has a current state-of-the-art performance on indicators such as BLEU_4(0.705) and METEOR (0.462).The results show that the proposed method can learn manipulation tasks from human demonstration videos.In particular, the framework is successfully applied to a Baxter robot.

HTML全文

参考文献(43)

施引文献

资源附件(0)