Command Palette
Search for a command to run...
Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization
Lukas Haas; Silas Alberti; Michal Skreta

Abstract
Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| photo-geolocation-estimation-on-im2gps | StreetCLIP (Zero-Shot) | City level (25 km): 28.3 Continent level (2500 km): 88.2 Country level (750 km): 74.7 Reference images: 0 Region level (200 km): 45.1 Training images: 1.1M |
| photo-geolocation-estimation-on-im2gps3k | StreetCLIP (Zero-Shot) | City level (25 km): 22.4 Continent level (2500 km): 80.4 Country level (750 km): 61.3 Region level (200 km): 37.4 Street level (1 km): - Training Images: 1.1M |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.