基于视觉语言模型的多模态无人机跨视图地理定位

陈鹏; 陈旭; 罗文; 林斌

doi:10.13973/j.cnki.robot.240283

基于视觉语言模型的多模态无人机跨视图地理定位

Multimodal Drone Cross-view Geo-localization Based on Vision-language Model

摘要

摘要: 无人机跨视图地理定位通过在卫星拒止条件下匹配机载图像与地理参照图像实现自主定位，主要挑战在于跨视图图像间的显著外观差异。现有方法多局限于局部特征提取，缺乏对上下文关联和全局语义的深入挖掘。为此，本文提出了一种基于视觉语言模型的多模态无人机跨视图地理定位模型。利用CLIP（contrastive language-image pre-training）模型构造了一个视图文本描述生成模块，将CLIP模型从海量数据集中学习到的图像级视觉概念作为外部知识，引导模型的特征提取过程。采用混合ViT（视觉Transformer）架构作为骨干网络，使模型在提取图像特征时兼顾局部特征与全局上下文特征的提取。此外，为了使模型能够更有效地学习到不同视图间的关联，还引入了基于逻辑得分标准化KL（Kullback-Leibler）散度的互学习损失函数来监督模型的训练过程。实验结果表明，在CLIP模型生成的文本描述结果引导下，所提模型更容易学习到深层语义信息，从而能够更好地应对跨视图地理定位过程中存在的视角差异、拍摄时间差异等挑战。

Abstract: Cross-view geo-localization for drones achieves autonomous positioning by matching onboard images with geo-referenced images in satellite-denied conditions, with the primary challenge lying in the significant appearance differences across cross-view images. Existing methods predominantly focus on local feature extraction while lacking in-depth exploration of contextual correlations and global semantics. To address this problem, a vision-language model based multimodal drone cross-view geo-localization framework is proposed in this paper. Leveraging the CLIP (contrastive language-image pre-training) model, a view text description generation module is constructed, which utilizes image-level visual concepts learned from large-scale datasets as external knowledge to guide the feature extraction process. A hybrid vision transformer (ViT) architecture is adopted as the backbone network, enabling the model to simultaneously capture local features and global contextual characteristics during image feature extraction. Furthermore, a mutual learning loss supervised by logic score-normalized Kullback-Leibler (KL) divergence is introduced to optimize the training process, in order to enhance the model ability to learn inter-view correlations. Experimental results demonstrate that under the guidance of text descriptions generated by the CLIP model, the proposed model learns deep semantic information more effectively, thereby better addressing challenges such as viewpoint variations and temporal discrepancies encountered in cross-view geo-localization.

HTML全文

参考文献(30)

施引文献

资源附件(0)