HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

Yuechen Yu; Yulin Li; Chengquan Zhang; Xiaoqiang Zhang; Zengyuan Guo; Xiameng Qin; Kun Yao; Junyu Han; Errui Ding; Jingdong Wang

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

Abstract

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
document-image-classification-on-rvl-cdipStrucTexTv2 (small)
Accuracy: 93.4%
Parameters: 28M
document-image-classification-on-rvl-cdipStrucTexTv2 (large)
Accuracy: 94.62%
Parameters: 238M
semantic-entity-labeling-on-funsdStrucTexTv2 (large)
F1: 91.82
semantic-entity-labeling-on-funsdStrucTexTv2 (small)
F1: 89.23
table-recognition-on-wtwStrucTexTv2 (small)
F1: 78.9%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp