Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong

Release Date: 5/21/2025

Scaling Computer-Use Grounding via User Interface Decomposition and
Synthesis

Abstract

Graphical user interface (GUI) grounding, the ability to map natural languageinstructions to specific actions on graphical user interfaces, remains acritical bottleneck in computer use agent development. Current benchmarksoversimplify grounding tasks as short referring expressions, failing to capturethe complexity of real-world interactions that require software commonsense,layout understanding, and fine-grained manipulation capabilities. To addressthese limitations, we introduce OSWorld-G, a comprehensive benchmark comprising564 finely annotated samples across diverse task types including text matching,element recognition, layout understanding, and precise manipulation.Additionally, we synthesize and release the largest computer use groundingdataset Jedi, which contains 4 million examples through multi-perspectivedecoupling of tasks. Our multi-scale models trained on Jedi demonstrate itseffectiveness by outperforming existing approaches on ScreenSpot-v2,ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improvedgrounding with Jedi directly enhances agentic capabilities of generalfoundation models on complex computer tasks, improving from 5% to 27% onOSWorld. Through detailed ablation studies, we identify key factorscontributing to grounding performance and verify that combining specializeddata for different interface elements enables compositional generalization tonovel interfaces. All benchmark, data, checkpoints, and code are open-sourcedand available at https://osworld-grounding.github.io.

View Paper Details View Code