HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Fan Liu Delong Chen Zhangqingyun Guan Xiaocong Zhou Jiale Zhu Qiaolin Ye Liyong Fu Jun Zhou

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Abstract

General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $\textit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP

Code Repositories

chendelong1999/remoteclip
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
cross-modal-retrieval-on-rsicdRemoteCLIP
Image-to-text R@1: 18.39%
Mean Recall: 36.35%
text-to-image R@1: 14.73%
cross-modal-retrieval-on-rsitmdRemoteCLIP
Image-to-text R@1: 28.76%
Mean Recall: 50.52%
text-to-imageR@1: 23.76%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp