HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

DocFormer: End-to-End Transformer for Document Understanding

Srikar Appalaraju Bhavan Jasani Bhargava Urala Kota Yusheng Xie R. Manmatha

DocFormer: End-to-End Transformer for Document Understanding

Abstract

We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

Code Repositories

shabie/docformer
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
document-image-classification-on-rvl-cdipDocFormerBASE
Accuracy: 96.17%
Parameters: 183M
document-image-classification-on-rvl-cdipDocFormer large
Accuracy: 95.50%
Parameters: 536M

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
DocFormer: End-to-End Transformer for Document Understanding | Papers | HyperAI