Command Palette
Search for a command to run...

摘要
我们介绍了多语言视觉与语言模型PaLI-X的训练方法及其在组件规模和训练任务多样性方面的扩展结果。该模型在多种复杂任务上实现了新的性能水平,包括基于图像的标题生成和问答任务、基于图像的文档理解、少量样本(上下文)学习,以及目标检测、视频问答和视频标题生成。PaLI-X在大多数考虑的视觉与语言基准测试中(超过25个)取得了最先进的成果。最后,我们观察到一些新兴能力的出现,例如复杂的计数和多语言目标检测,这些任务并未明确包含在训练任务组合中。
代码仓库
doc-doc/NExT-OE
pytorch
GitHub 中提及
kyegomez/PALI
pytorch
基准测试
| 基准 | 方法 | 指标 | 
|---|---|---|
| chart-question-answering-on-chartqa | PaLI-X (Single-task FT) | 1:1 Accuracy: 70.9  | 
| chart-question-answering-on-chartqa | PaLI-X (Multi-task FT) | 1:1 Accuracy: 70.6  | 
| chart-question-answering-on-chartqa | PaLI-X (Single-task FT w/ OCR) | 1:1 Accuracy: 72.3  | 
| fine-grained-image-recognition-on-oven | PaLI-X | Accuracy: 23.1  | 
| temporal-casual-qa-on-next-qa | PaLI-X | WUPS: 38.3  | 
| visual-question-answering-on-docvqa-test | PaLI-X (Single-task FT w/ OCR) | ANLS: 0.868  | 
| visual-question-answering-on-docvqa-test | PaLI-X (Single-task FT) | ANLS: 0.80  | 
| visual-question-answering-on-docvqa-test | PaLI-X (Multi-task FT) | ANLS: 0.809  | 
| visual-question-answering-on-ok-vqa | PaLI-X (Single-task FT) | Accuracy: 66.1  | 
| visual-question-answering-vqa-on | PaLI-X (Single-task FT) | ANLS: 49.2  | 
| visual-question-answering-vqa-on | PaLI-X (Multi-task FT) | ANLS: 50.7  | 
| visual-question-answering-vqa-on | PaLI-X (Single-task FT w/ OCR) | ANLS: 54.8  | 
| visual-question-answering-vqa-on-infoseek | PaLI-X | Accuracy: 24  |