Multimodal Drone Cross-view Geo-localization Based on Vision-language Model
-
Graphical Abstract
-
Abstract
Cross-view geo-localization for drones achieves autonomous positioning by matching onboard images with geo-referenced images in satellite-denied conditions, with the primary challenge lying in the significant appearance differences across cross-view images. Existing methods predominantly focus on local feature extraction while lacking in-depth exploration of contextual correlations and global semantics. To address this problem, a vision-language model based multimodal drone cross-view geo-localization framework is proposed in this paper. Leveraging the CLIP (contrastive language-image pre-training) model, a view text description generation module is constructed, which utilizes image-level visual concepts learned from large-scale datasets as external knowledge to guide the feature extraction process. A hybrid vision transformer (ViT) architecture is adopted as the backbone network, enabling the model to simultaneously capture local features and global contextual characteristics during image feature extraction. Furthermore, a mutual learning loss supervised by logic score-normalized Kullback-Leibler (KL) divergence is introduced to optimize the training process, in order to enhance the model ability to learn inter-view correlations. Experimental results demonstrate that under the guidance of text descriptions generated by the CLIP model, the proposed model learns deep semantic information more effectively, thereby better addressing challenges such as viewpoint variations and temporal discrepancies encountered in cross-view geo-localization.
-
-