HyperAI
HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Visual Question Answering (VQA)
Visual Question Answering On Gqa Test Dev
Visual Question Answering On Gqa Test Dev
Metrics
Accuracy
Results
Performance results of various models on this benchmark
Columns
Model Name
Accuracy
Paper Title
Repository
BLIP-2 ViT-G OPT 2.7B (zero-shot)
34.6
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
44.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
PNP-VQA
41.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
-
PaLI-X-VPD
67.3
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
-
LXMERT (Pre-train + scratch)
60.0
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
-
BLIP-2 ViT-L FlanT5 XL (zero-shot)
44.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
FewVLM (zero-shot)
29.3
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
-
BLIP-2 ViT-G OPT 6.7B (zero-shot)
36.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
HYDRA
47.9
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
-
NSM
62.95
Learning by Abstraction: The Neural State Machine
-
Lyrics
62.4
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
-
BLIP-2 ViT-L OPT 2.7B (zero-shot)
33.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
single-hop + LCGN (ours)
55.8
Language-Conditioned Graph Networks for Relational Reasoning
-
CFR
72.1
Coarse-to-Fine Reasoning for Visual Question Answering
-
BLIP-2 ViT-G FlanT5 XL (zero-shot)
44.2
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
-
Video-LaVIT
64.4
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
-
CuMo-7B
64.9
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
-
0 of 17 row(s) selected.
Previous
Next
Visual Question Answering On Gqa Test Dev | SOTA | HyperAI