HyperAIHyperAI

Visual Question Answering Vqa On

Metrics

ANLS

Results

Performance results of various models on this benchmark

Model Name
ANLS
Paper TitleRepository
GPT-3.5 + LATIN-Prompt48.98Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering-
Gemini Ultra (pixel only)80.3Gemini: A Family of Highly Capable Multimodal Models-
DUBLIN36.82DUBLIN -- Document Understanding By Language-Image Network-
PaLI-X (Single-task FT)49.2PaLI-X: On Scaling up a Multilingual Vision and Language Model-
Pix2Struct-base38.2Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding-
PaLI-357.8PaLI-3 Vision Language Models: Smaller, Faster, Stronger-
DUBLIN (variable resolution)42.6DUBLIN -- Document Understanding By Language-Image Network-
PaLI-X (Multi-task FT)50.7PaLI-X: On Scaling up a Multilingual Vision and Language Model-
PaLI-X (Single-task FT w/ OCR)54.8PaLI-X: On Scaling up a Multilingual Vision and Language Model-
SMoLA-PaLI-X Specialist66.2Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts-
Pix2Struct-large40Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding-
DocFormerv2-large48.8DocFormerv2: Local Features for Document Understanding-
MatCha37.2MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering-
ChatGPT 3.5 with LAPDoc Prompt (SpatialFormat)54.9LAPDoc: Layout-Aware Prompting for Documents-
Claude + LATIN-Prompt54.51Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering-
TILT-Large61.20Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer-
UDOP47.4Unifying Vision, Text, and Layout for Universal Document Processing-
ScreenAI 5B (4.62 B params, w/ OCR)65.90ScreenAI: A Vision-Language Model for UI and Infographics Understanding-
SMoLA-PaLI-X Generalist65.6Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts-
PaLI-3 (w/ OCR)62.4PaLI-3 Vision Language Models: Smaller, Faster, Stronger-
0 of 21 row(s) selected.
Visual Question Answering Vqa On | SOTA | HyperAI