HyperAI超神经
首页
资讯
最新论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
首页
SOTA
Visual Question Answering
Visual Question Answering On Ok Vqa
Visual Question Answering On Ok Vqa
评估指标
Accuracy
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Accuracy
Paper Title
Repository
PaLM-E-562B
66.1
PaLM-E: An Embodied Multimodal Language Model
PICa
48.0
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
MetaLM
11.4
Language Models are General-Purpose Interfaces
REVIVE (Ensemble)
58.0
REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering
A Simple Baseline for KB-VQA
61.2
A Simple Baseline for Knowledge-Based Visual Question Answering
-
Prophet
62.5
Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering
PNP-VQA
35.9
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
RA-VQA-FrDPR (T5-large)
51.22
Retrieval Augmented Visual Question Answering with Outside Knowledge
VLC-BERT
43.1
VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge
BLIP-2 ViT-L FlanT5 XL (zero-shot)
39.4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Frozen
5.9
Multimodal Few-Shot Learning with Frozen Language Models
-
T5(Tan and Bansal, 2019) + Prefixes
42.03
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VK-OOD
52.4
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
FewVLM
16.5
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
45.9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LaKo
47.01
LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection
VLKD(ViT-B/16)
10.5
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
-
RA-VQA-v2 (BLIP 2)
62.08
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering
-
BLIP-2 ViT-G OPT 2.7B (zero-shot)
31.7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Flamingo3B
41.2
Flamingo: a Visual Language Model for Few-Shot Learning
0 of 37 row(s) selected.
Previous
Next