Visual Question Answering Vqa On Infoseek
评估指标
Accuracy
评测结果
各个模型在此基准测试上的表现结果
模型名称 | Accuracy | Paper Title | Repository |
---|---|---|---|
PaLI-X | 24 | PaLI-X: On Scaling up a Multilingual Vision and Language Model | |
BLIP2 | 14.6 | BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | |
InstructBLIP | 14.5 | - | - |
CLIP + FiD | 20.9 | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | - |
CLIP + PaLM (540B) | 20.4 | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | - |
PaLI | 19.7 | Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? | - |
RA-VQAv2 w/ PreFLMR | 30.65 | PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers | - |
0 of 7 row(s) selected.