Visual Question Answering On Docvqa Test

Metrics

ANLS

Results

Performance results of various models on this benchmark

		Paper Title	Repository
Human	0.9436	DocVQA: A Dataset for VQA on Document Images
MLCD-Embodied-7B	0.916	Multi-label Cluster Discrimination for Visual Representation Learning
SMoLA-PaLI-X Specialist	0.908	Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts	-
SMoLA-PaLI-X Generalist	0.906	Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts	-
Qwen-VL-Plus	0.9024	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
ScreenAI 5B (4.62 B params, w/OCR)	0.8988	ScreenAI: A Vision-Language Model for UI and Infographics Understanding
PaLI-3 (w/ OCR)	0.886	PaLI-3 Vision Language Models: Smaller, Faster, Stronger
ERNIE-Layout large (ensemble)	0.8841	ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
GPT-4	0.884	Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
DocFormerv2-large	0.8784	DocFormerv2: Local Features for Document Understanding
UDOP (aux)	0.878	Unifying Vision, Text, and Layout for Universal Document Processing
PaLI-3	0.876	PaLI-3 Vision Language Models: Smaller, Faster, Stronger
TILT-Large	0.8705	Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
PaLI-X (Single-task FT w/ OCR)	0.868	PaLI-X: On Scaling up a Multilingual Vision and Language Model
LayoutLMv2LARGE	0.8672	LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
ERNIE-Layout large	0.8486	ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
UDOP	0.847	Unifying Vision, Text, and Layout for Universal Document Processing
TILT-Base	0.8392	Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
Claude + LATIN-Prompt	0.8336	Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
GPT-3.5 + LATIN-Prompt	0.8255	Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

0 of 33 row(s) selected.

Command Palette

Visual Question Answering On Docvqa Test

Metrics

Results