Abstract:
To enhance the robustness of multi-camera multi-target tracking in low-light environments, a multi-modal tracking algorithm fusing infrared and visible light videos is proposed, named FMMT(fusion based multi-camera multitarget tracking). The algorithm employs a deep neural network to adaptively fuse multi-modal features from visible-light and infrared cameras, and utilizes a global association Transformer for cross-camera target association. For validation, the first multi-modal multi-camera multi-target tracking dataset named M3 Track is constructed, which contains 20 scenes, 100k image pairs, and 1.129 million targets. Experimental results show that the proposed algorithm achieves 61.7 CVMA(cross-view matching accuracy) and 70.3 CVIDF1(cross-view IDF1) on M3 Track dataset, significantly outperforming comparative methods, especially in nighttime scenarios. This work provides an effective solution for multi-camera multi-target tracking in complex lighting conditions.