HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

OmniParser for Pure Vision Based GUI Agent

Yadong Lu Jianwei Yang Yelong Shen Ahmed Awadallah

OmniParser for Pure Vision Based GUI Agent

Abstract

The recent success of large vision language models shows great potential indriving the agent system operating on user interfaces. However, we argue thatthe power multimodal models like GPT-4V as a general agent on multipleoperating systems across different applications is largely underestimated dueto the lack of a robust screen parsing technique capable of: 1) reliablyidentifying interactable icons within the user interface, and 2) understandingthe semantics of various elements in a screenshot and accurately associate theintended action with the corresponding region on the screen. To fill thesegaps, we introduce OmniParser, a comprehensive method for parsing userinterface screenshots into structured elements, which significantly enhancesthe ability of GPT-4V to generate actions that can be accurately grounded inthe corresponding regions of the interface. We first curated an interactableicon detection dataset using popular webpages and an icon description dataset.These datasets were utilized to fine-tune specialized models: a detection modelto parse interactable regions on the screen and a caption model to extract thefunctional semantics of the detected elements. OmniParsersignificantly improves GPT-4V's performance on ScreenSpot benchmark. And onMind2Web and AITW benchmark, OmniParser with screenshot only inputoutperforms the GPT-4V baselines requiring additional information outside ofscreenshot.

Code Repositories

microsoft/omniparser
jax
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
natural-language-visual-grounding-onOmniParser
Accuracy (%): 73.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
OmniParser for Pure Vision Based GUI Agent | Papers | HyperAI