Multimodal Drone Cross-view Geo-localization Based on Vision-language Model

CHEN Peng; CHEN Xu; LUO Wen; LIN Bin

doi:10.13973/j.cnki.robot.240283

CHEN Peng, CHEN Xu, LUO Wen, LIN Bin. Multimodal Drone Cross-view Geo-localization Based on Vision-language Model[J]. ROBOT, 2025, 47(3): 416-426. DOI: 10.13973/j.cnki.robot.240283

Citation:

CHEN Peng, CHEN Xu, LUO Wen, LIN Bin. Multimodal Drone Cross-view Geo-localization Based on Vision-language Model[J]. ROBOT, 2025, 47(3): 416-426. DOI: 10.13973/j.cnki.robot.240283

Citation:

CHEN Peng, CHEN Xu, LUO Wen, LIN Bin. Multimodal Drone Cross-view Geo-localization Based on Vision-language Model[J]. ROBOT, 2025, 47(3): 416-426. DOI: 10.13973/j.cnki.robot.240283

Multimodal Drone Cross-view Geo-localization Based on Vision-language Model

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Cross-view geo-localization for drones achieves autonomous positioning by matching onboard images with geo-referenced images in satellite-denied conditions, with the primary challenge lying in the significant appearance differences across cross-view images. Existing methods predominantly focus on local feature extraction while lacking in-depth exploration of contextual correlations and global semantics. To address this problem, a vision-language model based multimodal drone cross-view geo-localization framework is proposed in this paper. Leveraging the CLIP (contrastive language-image pre-training) model, a view text description generation module is constructed, which utilizes image-level visual concepts learned from large-scale datasets as external knowledge to guide the feature extraction process. A hybrid vision transformer (ViT) architecture is adopted as the backbone network, enabling the model to simultaneously capture local features and global contextual characteristics during image feature extraction. Furthermore, a mutual learning loss supervised by logic score-normalized Kullback-Leibler (KL) divergence is introduced to optimize the training process, in order to enhance the model ability to learn inter-view correlations. Experimental results demonstrate that under the guidance of text descriptions generated by the CLIP model, the proposed model learns deep semantic information more effectively, thereby better addressing challenges such as viewpoint variations and temporal discrepancies encountered in cross-view geo-localization.

FullText(HTML)

References (30)

Cited By

Multimodal Drone Cross-view Geo-localization Based on Vision-language Model

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content