Command Palette
Search for a command to run...
Fangyu Liu Guy Emerson Nigel Collier

Abstract
Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-reasoning-on-vsr | LXMERT | accuracy: 70.1 |
| visual-reasoning-on-vsr | CLIP (frozen) | accuracy: 56.0 |
| visual-reasoning-on-vsr | CLIP (finetuned) | accuracy: 65.1 |
| visual-reasoning-on-vsr | ViLT | accuracy: 69.3 |
| visual-reasoning-on-vsr | VisualBERT | accuracy: 55.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.