HyperAI
HyperAI
Home
News
Latest Papers
Tutorials
Datasets
Wiki
SOTA
LLM Models
GPU Leaderboard
Events
Search
About
English
HyperAI
HyperAI
Toggle sidebar
Search the site…
⌘
K
Home
SOTA
Visual Question Answering (VQA)
Visual Question Answering On Docvqa Test
Visual Question Answering On Docvqa Test
Metrics
ANLS
Results
Performance results of various models on this benchmark
Columns
Model Name
ANLS
Paper Title
Repository
MatCha
0.742
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
-
GPT-4
0.884
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
-
PaLI-3
0.876
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
-
Qwen-VL
0.651
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
-
ERNIE-Layout large
0.8486
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
-
DUBLIN
0.782
DUBLIN -- Document Understanding By Language-Image Network
-
Pix2Struct-base
0.721
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
-
DUBLIN (variable resolution)
0.803
DUBLIN -- Document Understanding By Language-Image Network
-
PaLI-3 (w/ OCR)
0.886
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
-
Qwen-VL-Plus
0.9024
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
-
PaLI-X (Single-task FT w/ OCR)
0.868
PaLI-X: On Scaling up a Multilingual Vision and Language Model
-
PaLI-X (Single-task FT)
0.80
PaLI-X: On Scaling up a Multilingual Vision and Language Model
-
Claude + LATIN-Prompt
0.8336
Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
-
TILT-Large
0.8705
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
-
BERT_LARGE_SQUAD_DOCVQA_FINETUNED_Baseline
0.665
DocVQA: A Dataset for VQA on Document Images
-
DocFormerv2-large
0.8784
DocFormerv2: Local Features for Document Understanding
-
MLCD-Embodied-7B
0.916
Multi-label Cluster Discrimination for Visual Representation Learning
-
UDOP (aux)
0.878
Unifying Vision, Text, and Layout for Universal Document Processing
-
UDOP
0.847
Unifying Vision, Text, and Layout for Universal Document Processing
-
SMoLA-PaLI-X Generalist
0.906
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
-
0 of 33 row(s) selected.
Previous
Next
Visual Question Answering On Docvqa Test | SOTA | HyperAI