Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Graphical user interface (GUI) grounding, the ability to map natural languageinstructions to specific actions on graphical user interfaces, remains acritical bottleneck in computer use agent development. Current benchmarksoversimplify grounding tasks as short referring expressions, failing to capturethe complexity of real-world interactions that require software commonsense,layout understanding, and fine-grained manipulation capabilities. To addressthese limitations, we introduce OSWorld-G, a comprehensive benchmark comprising564 finely annotated samples across diverse task types including text matching,element recognition, layout understanding, and precise manipulation.Additionally, we synthesize and release the largest computer use groundingdataset Jedi, which contains 4 million examples through multi-perspectivedecoupling of tasks. Our multi-scale models trained on Jedi demonstrate itseffectiveness by outperforming existing approaches on ScreenSpot-v2,ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improvedgrounding with Jedi directly enhances agentic capabilities of generalfoundation models on complex computer tasks, improving from 5% to 27% onOSWorld. Through detailed ablation studies, we identify key factorscontributing to grounding performance and verify that combining specializeddata for different interface elements enables compositional generalization tonovel interfaces. All benchmark, data, checkpoints, and code are open-sourcedand available at https://osworld-grounding.github.io.