HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong; Weihan Wang; Qingsong Lv; Jiazheng Xu; Wenmeng Yu; Junhui Ji; Yan Wang; Zihan Wang; Yuxuan Zhang; Juanzi Li; Bin Xu; Yuxiao Dong; Ming Ding; Jie Tang

CogAgent: A Visual Language Model for GUI Agents

Abstract

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM, with a new version of CogAgent-9B-20241220 available at https://github.com/THUDM/CogAgent.

Code Repositories

digirl-agent/digirl
pytorch
Mentioned in GitHub
THUDM/CogAgent
Official
pytorch
Mentioned in GitHub
thudm/cogvlm
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
natural-language-visual-grounding-onCogAgent
Accuracy (%): 47.4
visual-question-answering-on-mm-vetCogAgent
GPT-4 score: 52.8
Params: 18B
visual-question-answering-on-mm-vet-v2CogAgent-Chat
GPT-4 score: 34.7±0.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp