3 months ago

DocFormer: End-to-End Transformer for Document Understanding

Srikar Appalaraju Bhavan Jasani Bhargava Urala Kota Yusheng Xie R. Manmatha

Abstract

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

Code Repositories

shabie/docformer

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
document-image-classification-on-rvl-cdip	DocFormerBASE	Accuracy: 96.17% Parameters: 183M
document-image-classification-on-rvl-cdip	DocFormer large	Accuracy: 95.50% Parameters: 536M

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette