HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Xu Yiheng ; Wang Zekun ; Wang Junli ; Lu Dunjie ; Xie Tianbao ; Saha Amrita ; Sahoo Doyen ; Yu Tao ; Xiong Caiming

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Abstract

Automating GUI tasks remains challenging due to reliance on textualrepresentations, platform-specific action spaces, and limited reasoningcapabilities. We introduce Aguvis, a unified vision-based framework forautonomous GUI agents that directly operates on screen images, standardizescross-platform interactions and incorporates structured reasoning via innermonologue. To enable this, we construct Aguvis Data Collection, a large-scaledataset with multimodal grounding and reasoning annotations, and develop atwo-stage training pipeline that separates GUI grounding from planning andreasoning. Experiments show that Aguvis achieves state-of-the-art performanceacross offline and real-world online benchmarks, marking the first fullyautonomous vision-based GUI agent that operates without closed-source models.We open-source all datasets, models, and training recipes athttps://aguvis-project.github.io to advance future research.

Code Repositories

xlang-ai/aguvis
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
natural-language-visual-grounding-onAguvis-7B
Accuracy (%): 83.0
natural-language-visual-grounding-onAguvis-G-7B
Accuracy (%): 81.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction | Papers | HyperAI