Command Palette
Search for a command to run...
Srikar Appalaraju Bhavan Jasani Bhargava Urala Kota Yusheng Xie R. Manmatha

Abstract
We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| document-image-classification-on-rvl-cdip | DocFormerBASE | Accuracy: 96.17% Parameters: 183M |
| document-image-classification-on-rvl-cdip | DocFormer large | Accuracy: 95.50% Parameters: 536M |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.