5 months ago

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Xu Yiheng ; Wang Zekun ; Wang Junli ; Lu Dunjie ; Xie Tianbao ; Saha Amrita ; Sahoo Doyen ; Yu Tao ; Xiong Caiming

Abstract

Automating GUI tasks remains challenging due to reliance on textualrepresentations, platform-specific action spaces, and limited reasoningcapabilities. We introduce Aguvis, a unified vision-based framework forautonomous GUI agents that directly operates on screen images, standardizescross-platform interactions and incorporates structured reasoning via innermonologue. To enable this, we construct Aguvis Data Collection, a large-scaledataset with multimodal grounding and reasoning annotations, and develop atwo-stage training pipeline that separates GUI grounding from planning andreasoning. Experiments show that Aguvis achieves state-of-the-art performanceacross offline and real-world online benchmarks, marking the first fullyautonomous vision-based GUI agent that operates without closed-source models.We open-source all datasets, models, and training recipes athttps://aguvis-project.github.io to advance future research.

Code Repositories

xlang-ai/aguvis

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
natural-language-visual-grounding-on	Aguvis-7B	Accuracy (%): 83.0
natural-language-visual-grounding-on	Aguvis-G-7B	Accuracy (%): 81.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette