基于上下文聚合策略的轻量级编/解码抓取位姿检测
Lightweight Encoding-Decoding Grasp Pose Detection Based on a Context Aggregation Strategy
-
摘要: 针对多样性目标在非结构化环境中的抓取位姿难以估计的问题, 提出一种基于上下文聚合策略的轻量级编/解码抓取位姿检测网络。首先, 以编/解码网络架构为基础, 利用深度可分离卷积层与混洗单元构建目标特征深度分离-融合提取块, 减少编码网络参数量, 增强网络对抓取区域特征的提取能力; 其次, 利用双线性插值法和深度可分离卷积层建立深度分离-重构块, 在恢复高层特征丢失信息的同时, 有效减少解码网络的参数量; 最后, 针对可抓取区域像素点与目标物体全貌之间的非一致性问题, 基于交叉熵辅助损失和自注意力机制, 提出一种抓取区域上下文聚合策略, 引导网络增强可抓取目标区域特征的表征能力, 抑制非抓取像素点的冗余特征。实验结果表明, 所提网络在Cornell数据集的图像拆分与对象拆分子集上抓取检测准确率分别可达97.8%与93.8%, 单张图像检测速度可达64.93张/秒; 在Jacquard数据集上抓取检测准确率可达95.1%, 单张图像检测速度可达60.6张/秒。与对比网络相比, 所提网络不仅计算量与参数量较小, 而且抓取检测的准确率与速度均有明显提升, 在真实场景下对9种物体的抓取检测验证中, 抓取成功率达到93.3%。Abstract: It is difficult to estimate the grasp pose of diverse targets in an unstructured environment. For this problem, a lightweight encoding/decoding grasp pose detection network based on context aggregation strategy is proposed. Firstly, the deep separation-fusion extraction block of target features is constructed based on the encoding/decoding network architecture by using the depth separable convolution and the shuffle unit to reduce the number of the encoding network parameters and enhance the network's ability to extract features of the grasp region. Then, the bilinear interpolation and the depth separable convolution are used to establish the deep separation-reconstruction block, which can effectively reduce the parameters of the decoding network while restoring the lost information of high-level features. Finally, in view of the inconsistency between the pixels in the graspable area and the whole picture of the target object, a grasp region context aggregation strategy is proposed based on cross entropy auxiliary loss and self-attention mechanism to guide the network to enhance the representation ability of the features of the graspable target area and suppress the redundant features of non-graspable pixels. The experimental results show that the grasp and detection accuracies of the proposed network on the image-wise and object-wise subsets of Cornell dataset can reach 97.8% and 93.8% respectively, and the detection speed of a single image can reach 64.93 frame/s; on Jacquard dataset, the detection accuracy can reach 95.1%, and the detection speed of a single image can reach 60.6 frame/s.Compared with the comparative networks, the proposed network not only has a small amount of calculation and parameters, but also has a significant improvement in the accuracy and speed of grasp detection. In the verification of grasp detection of 9 objects in the real scene, the success rate of grasp reaches 93.3%.