HyperAI超神经

Visual Question Answering On Gqa Test Dev

评估指标

Accuracy

评测结果

各个模型在此基准测试上的表现结果

模型名称
Accuracy
Paper TitleRepository
BLIP-2 ViT-G OPT 2.7B (zero-shot)34.6BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)44.7BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
PNP-VQA41.9Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
PaLI-X-VPD67.3Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models-
LXMERT (Pre-train + scratch)60.0LXMERT: Learning Cross-Modality Encoder Representations from Transformers
BLIP-2 ViT-L FlanT5 XL (zero-shot)44.4BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
FewVLM (zero-shot)29.3A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G OPT 6.7B (zero-shot)36.4BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
HYDRA47.9HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
NSM62.95Learning by Abstraction: The Neural State Machine
Lyrics62.4Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects-
BLIP-2 ViT-L OPT 2.7B (zero-shot)33.9BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
single-hop + LCGN (ours)55.8Language-Conditioned Graph Networks for Relational Reasoning
CFR72.1Coarse-to-Fine Reasoning for Visual Question Answering
BLIP-2 ViT-G FlanT5 XL (zero-shot)44.2BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Video-LaVIT64.4Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
CuMo-7B64.9CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
0 of 17 row(s) selected.