13 days ago

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu

View Paper Details View Code

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy
Optimization

Abstract

The emergence of Multimodal Large Language Models (MLLMs) has propelled thedevelopment of autonomous agents that operate on Graphical User Interfaces(GUIs) using pure visual input. A fundamental challenge is robustly groundingnatural language instructions. This requires a precise spatial alignment, whichaccurately locates the coordinates of each element, and, more critically, acorrect semantic alignment, which matches the instructions to the functionallyappropriate UI element. Although Reinforcement Learning with Verifiable Rewards(RLVR) has proven to be effective at improving spatial alignment for theseMLLMs, we find that inefficient exploration bottlenecks semantic alignment,which prevent models from learning difficult semantic associations. To addressthis exploration problem, we present Adaptive Exploration Policy Optimization(AEPO), a new policy optimization framework. AEPO employs a multi-answergeneration strategy to enforce broader exploration, which is then guided by atheoretically grounded Adaptive Exploration Reward (AER) function derived fromfirst principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3Band InfiGUI-G1-7B, establish new state-of-the-art results across multiplechallenging GUI grounding benchmarks, achieving significant relativeimprovements of up to 9.0% against the naive RLVR baseline on benchmarksdesigned to test generalization and semantic understanding. Resources areavailable at https://github.com/InfiXAI/InfiGUI-G1.