HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Measuring Progress in Fine-grained Vision-and-Language Understanding

Emanuele Bugliarello Laurent Sartran Aishwarya Agrawal Lisa Anne Hendricks Aida Nematzadeh

Measuring Progress in Fine-grained Vision-and-Language Understanding

Abstract

While pretraining on large-scale image-text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained models lack "fine-grained" understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.

Code Repositories

e-bug/weak-relation-vlm
pytorch
Mentioned in GitHub
e-bug/fine-grained-evals
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-reasoning-on-winogroundBLIP 129M (CapFilt/L)
Group Score: 12.2
Image Score: 15.2
Text Score: 34.7
visual-reasoning-on-winogroundX-VLM 4M
Group Score: 21.5
Image Score: 26.7
Text Score: 44.0
visual-reasoning-on-winogroundPEVL 14M
Group Score: 12.2
Image Score: 15.7
Text Score: 33.2
visual-reasoning-on-winogroundX-VLM 16M
Group Score: 21.2
Image Score: 24.5
Text Score: 46.7
visual-reasoning-on-winogroundBLIP 129M
Group Score: 11.7
Image Score: 15.0
Text Score: 35.5
visual-reasoning-on-winogroundALBEF 14M
Group Score: 12.7
Image Score: 16.2
Text Score: 32.5
visual-reasoning-on-winogroundBLIP 14M
Group Score: 14.5
Image Score: 18.5
Text Score: 36.5
visual-reasoning-on-winogroundBLIP-ViT/L 129M
Group Score: 12.2
Image Score: 14.5
Text Score: 34.7
visual-reasoning-on-winogroundALBEF 4M
Group Score: 11.0
Image Score: 15.5
Text Score: 29.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp