Abstract:
Binocular stereo matching algorithms based on deep learning usually use convolutional neural network (CNN) to extract features. However, it has inherent limitations such as limited receptive field and weight sharing of convolution kernels, making it difficult to extract features with strong recognition, and resulting in lower matching accuracy in challenging regions, such as weak-textured regions and detailed regions. To solve this problem, a binocular stereo matching algorithm based on three-branch hybrid feature extractor is proposed. Specifically, the CNN branch, Swin Transformer branch, and fusion branch are set in parallel, and feature extraction is performed on the left and right images. The parallel branch setting effectively preserves the local feature expression ability of CNN and the global feature expression ability of Swin Transformer framework. The fusion branch is composed of multi-stage global-local information adapters. It can not only realize the fusion and expression of global information and local information in this stage, but also effectively realize the propagation of features across different stages. Moreover, the strong correlation feature information suitable for weak-textured regions and detailed regions is screened out, thus enhancing stereo matching accuracy. Ablation experiments on the SceneFlow dataset verify the effectiveness of the proposed algorithm. SceneFlow, KITTI 2012, and KITTI 2015 datasets are used for tests. The proposed method achieves 0.652 pixel as end point error (EPE) on SceneFlow dataset, and 0.79% as the percentage of pixels in non-occluded regions with disparity error greater than 5 pixels on KITTI 2012 dataset. Results show that the proposed algorithm has an excellent stereo matching accuracy.