HyperAIHyperAI
a month ago

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Yilun Zhao, Chengye Wang, Chuhan Li, Arman Cohan
Can Multimodal Foundation Models Understand Schematic Diagrams? An
  Empirical Study on Information-Seeking QA over Scientific Papers
Abstract

This paper introduces MISS-QA, the first benchmark specifically designed toevaluate the ability of models to interpret schematic diagrams withinscientific literature. MISS-QA comprises 1,500 expert-annotated examples over465 scientific papers. In this benchmark, models are tasked with interpretingschematic diagrams that illustrate research overviews and answeringcorresponding information-seeking questions based on the broader context of thepaper. We assess the performance of 18 frontier multimodal foundation models,including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significantperformance gap between these models and human experts on MISS-QA. Our analysisof model performance on unanswerable questions and our detailed error analysisfurther highlight the strengths and limitations of current models, offering keyinsights to enhance models in comprehending multimodal scientific literature.

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers | Latest Papers | HyperAI