Command Palette
Search for a command to run...
Xu Yiheng ; Wang Zekun ; Wang Junli ; Lu Dunjie ; Xie Tianbao ; Saha Amrita ; Sahoo Doyen ; Yu Tao ; Xiong Caiming

Abstract
Automating GUI tasks remains challenging due to reliance on textualrepresentations, platform-specific action spaces, and limited reasoningcapabilities. We introduce Aguvis, a unified vision-based framework forautonomous GUI agents that directly operates on screen images, standardizescross-platform interactions and incorporates structured reasoning via innermonologue. To enable this, we construct Aguvis Data Collection, a large-scaledataset with multimodal grounding and reasoning annotations, and develop atwo-stage training pipeline that separates GUI grounding from planning andreasoning. Experiments show that Aguvis achieves state-of-the-art performanceacross offline and real-world online benchmarks, marking the first fullyautonomous vision-based GUI agent that operates without closed-source models.We open-source all datasets, models, and training recipes athttps://aguvis-project.github.io to advance future research.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| natural-language-visual-grounding-on | Aguvis-7B | Accuracy (%): 83.0 |
| natural-language-visual-grounding-on | Aguvis-G-7B | Accuracy (%): 81.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.