Command Palette
Search for a command to run...
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
Liu Fangyu ; Piccinno Francesco ; Krichene Syrine ; Pang Chenxi ; Lee Kenton ; Joshi Mandar ; Altun Yasemin ; Collier Nigel ; Eisenschlos Julian Martin

Abstract
Visual language data such as plots, charts, and infographics are ubiquitousin the human world. However, state-of-the-art vision-language models do notperform well on these data. We propose MatCha (Math reasoning and Chartderendering pretraining) to enhance visual language models' capabilities injointly modeling charts/plots and language data. Specifically, we proposeseveral pretraining tasks that cover plot deconstruction and numericalreasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recentlyproposed image-to-text visual language model. On standard benchmarks such asPlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by asmuch as nearly 20%. We also examine how well MatCha pretraining transfers todomains such as screenshots, textbook diagrams, and document figures andobserve overall improvement, verifying the usefulness of MatCha pretraining onbroader visual language tasks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| chart-question-answering-on-chartqa | MatCha | 1:1 Accuracy: 64.2 |
| chart-question-answering-on-plotqa | MatCha | 1:1 Accuracy: 91.5 |
| chart-question-answering-on-realcqa | Matcha-chartQA | 1:1 Accuracy: 0.259728175283818 |
| visual-question-answering-on-docvqa-test | MatCha | ANLS: 0.742 |
| visual-question-answering-on-plotqa-d1-1 | MatCha | 1:1 Accuracy: 92.3 |
| visual-question-answering-on-plotqa-d2-1 | MatCha | 1:1 Accuracy: 90.7 |
| visual-question-answering-vqa-on | MatCha | ANLS: 37.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.