5 months ago

GUICourse: From General Vision Language Models to Versatile GUI Agents

Chen Wentong ; Cui Junbo ; Hu Jinyi ; Qin Yujia ; Fang Junjie ; Zhao Yue ; Wang Chongyi ; Liu Jun ; Chen Guirong ; Huo

Abstract

Utilizing Graphic User Interface (GUI) for human-computer interaction isessential for accessing a wide range of digital tools. Recent advancements inVision Language Models (VLMs) highlight the compelling potential to developversatile agents to help humans finish GUI navigation tasks. However, currentVLMs are challenged in terms of fundamental abilities (OCR and grounding) andGUI knowledge (the functions and control methods of GUI elements), preventingthem from becoming practical GUI agents. To solve these challenges, wecontribute GUICourse, a suite of datasets to train visual-based GUI agents fromgeneral VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR andgrounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChatdatasets to enrich their knowledge of GUI components and interactions.Experiments demonstrate that our GUI agents have better performance on commonGUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1Bparameters) can still work well on single-step and multi-step GUI tasks.Finally, we analyze the different varieties in the training stage of this agentby ablation study. Our source codes and datasets are released athttps://github.com/yiye3/GUICourse.

Code Repositories

yiye3/guicourse

Official

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
natural-language-visual-grounding-on	Qwen-GUI	Accuracy (%): 28.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette