HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Letitia Parcalabescu; Michele Cafagna; Lilitta Muradjan; Anette Frank; Iacer Calixto; Albert Gatt

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Abstract

We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.

Code Repositories

heidelberg-nlp/valse
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-sentence-alignment-on-valseViLBERT 12-in-1
Average Accuracy: 63.2
average pairwise accuracy: 75.1
image-sentence-alignment-on-valseLXMERT
Average Accuracy: 53.5
average pairwise accuracy: 59.6
image-sentence-alignment-on-valseCLIP
average pairwise accuracy: 64.0
image-sentence-alignment-on-valseViLBERT
Average Accuracy: 51.3
average pairwise accuracy: 63.7
image-sentence-alignment-on-valseVisualBERT
Average Accuracy: 48.8
average pairwise accuracy: 46.4
image-sentence-alignment-on-valseGPT1
average pairwise accuracy: 60.7
image-sentence-alignment-on-valseGPT2
average pairwise accuracy: 60.1
image-sentence-alignment-on-valse-actant-swapLXMERT
Accuracy (%): 48.5
pairwise accuracy: 45.8
image-sentence-alignment-on-valse-actant-swapCLIP
pairwise accuracy: 68.6
image-sentence-alignment-on-valse-actant-swapViLBERT 12-in-1
Accuracy (%): 52.2
pairwise accuracy: 58.9
image-sentence-alignment-on-valse-actant-swapVisualBERT
Accuracy (%): 49.7
pairwise accuracy: 44.4
image-sentence-alignment-on-valse-actant-swapGPT2
pairwise accuracy: 76.9
image-sentence-alignment-on-valse-actant-swapViLBERT
Accuracy (%): 50.4
pairwise accuracy: 68.3
image-sentence-alignment-on-valse-actant-swapGPT1
pairwise accuracy: 72.2
image-sentence-alignment-on-valse-actionGPT2
pairwise accuracy: 66.8
image-sentence-alignment-on-valse-actionVisualBERT
Accuracy (%): 48.8
pairwise accuracy: 49.2
image-sentence-alignment-on-valse-actionGPT1
pairwise accuracy: 65.4
image-sentence-alignment-on-valse-actionLXMERT
Accuracy (%): 51.1
pairwise accuracy: 54.8
image-sentence-alignment-on-valse-actionViLBERT
Accuracy (%): 52.6
pairwise accuracy: 70.7
image-sentence-alignment-on-valse-actionCLIP
pairwise accuracy: 75.6
image-sentence-alignment-on-valse-actionViLBERT 12-in-1
Accuracy (%): 57.3
pairwise accuracy: 65.9
image-sentence-alignment-on-valse-coreferenceViLBERT 12-in-1
Accuracy (%): 54.4
pairwise accuracy: 75.7
image-sentence-alignment-on-valse-coreferenceCLIP
pairwise accuracy: 52.1
image-sentence-alignment-on-valse-coreferenceLXMERT
Accuracy (%): 49.8
pairwise accuracy: 46.8
image-sentence-alignment-on-valse-coreferenceViLBERT
Accuracy (%): 50.0
pairwise accuracy: 47.2
image-sentence-alignment-on-valse-coreferenceVisualBERT
Accuracy (%): 50.0
pairwise accuracy: 49.5
image-sentence-alignment-on-valse-coreferenceGPT1
pairwise accuracy: 45.6
image-sentence-alignment-on-valse-coreferenceGPT2
pairwise accuracy: 54.5
image-sentence-alignment-on-valse-coreference-1VisualBERT
Accuracy (%): 50.0
pairwise accuracy: 47.6
image-sentence-alignment-on-valse-coreference-1ViLBERT 12-in-1
Accuracy (%): 54.3
pairwise accuracy: 69.2
image-sentence-alignment-on-valse-coreference-1GPT1
pairwise accuracy: 45.2
image-sentence-alignment-on-valse-coreference-1CLIP
pairwise accuracy: 49.7
image-sentence-alignment-on-valse-coreference-1GPT2
pairwise accuracy: 50.0
image-sentence-alignment-on-valse-coreference-1LXMERT
Accuracy (%): 49.0
pairwise accuracy: 44.2
image-sentence-alignment-on-valse-coreference-1ViLBERT
Accuracy (%): 50.0
pairwise accuracy: 48.1
image-sentence-alignment-on-valse-countingLXMERT
Accuracy (%): 52.0
pairwise accuracy: 62.2
image-sentence-alignment-on-valse-countingViLBERT 12-in-1
Accuracy (%): 64.9
pairwise accuracy: 76.7
image-sentence-alignment-on-valse-countingGPT2
pairwise accuracy: 51.6
image-sentence-alignment-on-valse-countingVisualBERT
Accuracy (%): 48.3
pairwise accuracy: 48.2
image-sentence-alignment-on-valse-countingCLIP
pairwise accuracy: 62.1
image-sentence-alignment-on-valse-countingGPT1
pairwise accuracy: 51.2
image-sentence-alignment-on-valse-countingViLBERT
Accuracy (%): 50.7
pairwise accuracy: 58.6
image-sentence-alignment-on-valse-counting-1VisualBERT
Accuracy (%): 47.8
pairwise accuracy: 48.2
image-sentence-alignment-on-valse-counting-1ViLBERT
Accuracy (%): 50.6
pairwise accuracy: 62.9
image-sentence-alignment-on-valse-counting-1CLIP
pairwise accuracy: 62.5
image-sentence-alignment-on-valse-counting-1ViLBERT 12-in-1
Accuracy (%): 69.2
pairwise accuracy: 80.2
image-sentence-alignment-on-valse-counting-1GPT1
pairwise accuracy: 48.7
image-sentence-alignment-on-valse-counting-1LXMERT
Accuracy (%): 55.4
pairwise accuracy: 69.2
image-sentence-alignment-on-valse-counting-1GPT2
pairwise accuracy: 49.8
image-sentence-alignment-on-valse-counting-2ViLBERT
Accuracy (%): 51.8
pairwise accuracy: 73.7
image-sentence-alignment-on-valse-counting-2GPT1
pairwise accuracy: 69.5
image-sentence-alignment-on-valse-counting-2CLIP
pairwise accuracy: 57.5
image-sentence-alignment-on-valse-counting-2GPT2
pairwise accuracy: 45.3
image-sentence-alignment-on-valse-counting-2LXMERT
Accuracy (%): 49.9
pairwise accuracy: 42.6
image-sentence-alignment-on-valse-counting-2VisualBERT
Accuracy (%): 50.0
pairwise accuracy: 50.0
image-sentence-alignment-on-valse-counting-2ViLBERT 12-in-1
Accuracy (%): 66.7
pairwise accuracy: 77.3
image-sentence-alignment-on-valse-existenceVisualBERT
Accuracy (%): 49.3
pairwise accuracy: 39.7
image-sentence-alignment-on-valse-existenceLXMERT
Accuracy (%): 55.8
pairwise accuracy: 78.6
image-sentence-alignment-on-valse-existenceCLIP
pairwise accuracy: 66.9
image-sentence-alignment-on-valse-existenceViLBERT 12-in-1
Accuracy (%): 89.0
pairwise accuracy: 95.6
image-sentence-alignment-on-valse-existenceViLBERT
Accuracy (%): 2.4
pairwise accuracy: 66.5
image-sentence-alignment-on-valse-existenceGPT1
pairwise accuracy: 61.8
image-sentence-alignment-on-valse-existenceGPT2
pairwise accuracy: 58.0
image-sentence-alignment-on-valse-foil-itGPT2
pairwise accuracy: 80.7
image-sentence-alignment-on-valse-foil-itViLBERT 12-in-1
Accuracy (%): 71.5
pairwise accuracy: 86.9
image-sentence-alignment-on-valse-foil-itGPT1
pairwise accuracy: 77.5
image-sentence-alignment-on-valse-foil-itLXMERT
Accuracy (%): 70.8
pairwise accuracy: 87.1
image-sentence-alignment-on-valse-foil-itVisualBERT
Accuracy (%): 46.6
pairwise accuracy: 48.5
image-sentence-alignment-on-valse-foil-itViLBERT
Accuracy (%): 55.9
pairwise accuracy: 86.9
image-sentence-alignment-on-valse-foil-itCLIP
pairwise accuracy: 88.8
image-sentence-alignment-on-valse-pluralityViLBERT 12-in-1
Accuracy (%): 62.0
pairwise accuracy: 72.4
image-sentence-alignment-on-valse-pluralityLXMERT
Accuracy (%): 55.1
pairwise accuracy: 64.4
image-sentence-alignment-on-valse-pluralityCLIP
pairwise accuracy: 56.2
image-sentence-alignment-on-valse-pluralityGPT1
pairwise accuracy: 53.1
image-sentence-alignment-on-valse-pluralityViLBERT
Accuracy (%): 50.3
pairwise accuracy: 61.2
image-sentence-alignment-on-valse-pluralityVisualBERT
Accuracy (%): 46.5
pairwise accuracy: 45.7
image-sentence-alignment-on-valse-pluralityGPT2
pairwise accuracy: 51.9
image-sentence-alignment-on-valse-spatialVisualBERT
Accuracy (%): 49.3
pairwise accuracy: 39.7
image-sentence-alignment-on-valse-spatialGPT2
pairwise accuracy: 75.0
image-sentence-alignment-on-valse-spatialCLIP
pairwise accuracy: 64.3
image-sentence-alignment-on-valse-spatialLXMERT
Accuracy (%): 50.8
pairwise accuracy: 60.2
image-sentence-alignment-on-valse-spatialViLBERT
Accuracy (%): 49.9
pairwise accuracy: 57.2
image-sentence-alignment-on-valse-spatialViLBERT 12-in-1
Accuracy (%): 53.4
pairwise accuracy: 67.7
image-sentence-alignment-on-valse-spatialGPT1
pairwise accuracy: 77.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp