莫秀云, 陈俊洪, 杨振国, 刘文印. 基于人类演示视频的机器人指令生成框架[J]. 机器人, 2022, 44(2): 186-194, 202. DOI: 10.13973/j.cnki.robot.200539
引用本文: 莫秀云, 陈俊洪, 杨振国, 刘文印. 基于人类演示视频的机器人指令生成框架[J]. 机器人, 2022, 44(2): 186-194, 202. DOI: 10.13973/j.cnki.robot.200539
MO Xiuyun, CHEN Junhong, YANG Zhenguo, LIU Wenyin. A Robotic Command Generation Framework Based on Human Demonstration Videos[J]. ROBOT, 2022, 44(2): 186-194, 202. DOI: 10.13973/j.cnki.robot.200539
Citation: MO Xiuyun, CHEN Junhong, YANG Zhenguo, LIU Wenyin. A Robotic Command Generation Framework Based on Human Demonstration Videos[J]. ROBOT, 2022, 44(2): 186-194, 202. DOI: 10.13973/j.cnki.robot.200539

基于人类演示视频的机器人指令生成框架

A Robotic Command Generation Framework Based on Human Demonstration Videos

  • 摘要: 为了提高机器人学习技能的能力,免除人工示教过程,本文基于对无特殊标记的人类演示视频的观察,提出了一种基于序列到序列模式的机器人指令自动生成框架。首先,使用Mask R-CNN(区域卷积神经网络)来缩小操作区域的范围,并采用双流I3D网络(膨胀3D卷积网络)从视频中提取光流特征和RGB特征;其次,引入双向LSTM(长短期记忆)网络从先前提取的特征中获取上下文信息;最后,使用自我注意力机制和全局注意力机制,学习视频帧序列和命令序列的关联性,序列到序列模型最终输出机器人的命令。在扩展后的MPII烹饪活动2数据集和IIT-V2C数据集上进行了大量的实验,与现有的方法进行比较,本文提出的方法在BLEU_4(0.705)和METEOR(0.462)等指标上达到目前最先进性能水平。结果表明,该方法能够从人类演示视频中学习操作任务。此外,本框架成功应用于Baxter机器人。

     

    Abstract: In order to improve the robot's ability of skills learning and avoid the manual teaching process, a sequence-to-sequence framework is proposed to automatically generate robotic commands based on the observation of human demonstration videos without any special marks.Firstly, a Mask R-CNN (region-based convolutional neural network) is used to reduce the manipulation area, and a two-stream I3D network (inflated 3D convolutional network) is adopted to extract the optical features as well as the RGB features from the videos.Secondly, a bidirectional LSTM (long short-term memory) network is introduced for acquiring the context information from the previous features extracted.Finally, self-attention and global attention mechanisms are integrated to learn the correlation between a sequence of video frames and a sequence of commands, and the sequence-to-sequence model ultimately outputs the robotic commands.Experiments are extensively conducted on the expanded MPII Cooking 2 dataset and the IIT-V2C dataset.Compared with the existing methods, the proposed method has a current state-of-the-art performance on indicators such as BLEU_4(0.705) and METEOR (0.462).The results show that the proposed method can learn manipulation tasks from human demonstration videos.In particular, the framework is successfully applied to a Baxter robot.

     

/

返回文章
返回