基于三分支混合特征提取的双目立体匹配算法

范诗萌; 孙炜; 覃宇; 覃业宝; 胡曼倩; 刘崇沛

doi:10.13973/j.cnki.robot.230235

基于三分支混合特征提取的双目立体匹配算法

A Binocular Stereo Matching Algorithm Based on Three-branch Hybrid Feature Extractor

摘要

摘要: 基于深度学习的双目立体匹配算法大多采用卷积神经网络（CNN）进行特征提取。但该网络存在感受野有限、卷积核权重共享等固有的局限性，难以提取到强辨识度的特征，易导致弱纹理区域、细节区域等有挑战性区域的匹配精度较低。针对该问题，本文提出一种基于三分支混合特征提取的双目立体匹配算法。具体地，将CNN分支、Swin Transformer分支、融合分支并联设置，并对左、右图像进行特征提取，并联分支设置有效地保留了CNN的局部特征表达能力和Swin Transformer框架的全局特征表达能力。融合分支由多阶段的全局－局部信息适配器组成，不仅能实现本阶段全局信息和局部信息的融合与表达，而且能够跨不同阶段有效地传播特征，从而筛选出适用于弱纹理区域和细节区域的强相关特征信息，提高了立体匹配的精度。在SceneFlow数据集上进行的消融实验验证了本文算法的有效性。利用SceneFlow、KITTI 2012、KITTI 2015数据集进行了测试。本文方法在SceneFlow数据集上的端点误差为0.652个像素；在KITTI 2012数据集上的非遮挡区域，视差误差大于5个像素的百分比为0.79%。结果表明本文算法具有优异的立体匹配精度。

Abstract: Binocular stereo matching algorithms based on deep learning usually use convolutional neural network (CNN) to extract features. However, it has inherent limitations such as limited receptive field and weight sharing of convolution kernels, making it difficult to extract features with strong recognition, and resulting in lower matching accuracy in challenging regions, such as weak-textured regions and detailed regions. To solve this problem, a binocular stereo matching algorithm based on three-branch hybrid feature extractor is proposed. Specifically, the CNN branch, Swin Transformer branch, and fusion branch are set in parallel, and feature extraction is performed on the left and right images. The parallel branch setting effectively preserves the local feature expression ability of CNN and the global feature expression ability of Swin Transformer framework. The fusion branch is composed of multi-stage global-local information adapters. It can not only realize the fusion and expression of global information and local information in this stage, but also effectively realize the propagation of features across different stages. Moreover, the strong correlation feature information suitable for weak-textured regions and detailed regions is screened out, thus enhancing stereo matching accuracy. Ablation experiments on the SceneFlow dataset verify the effectiveness of the proposed algorithm. SceneFlow, KITTI 2012, and KITTI 2015 datasets are used for tests. The proposed method achieves 0.652 pixel as end point error (EPE) on SceneFlow dataset, and 0.79% as the percentage of pixels in non-occluded regions with disparity error greater than 5 pixels on KITTI 2012 dataset. Results show that the proposed algorithm has an excellent stereo matching accuracy.

HTML全文

参考文献(32)

施引文献

资源附件(0)