HyperAIHyperAI

Command Palette

Search for a command to run...

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou Jingren Zhou

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scalevision-language models (LVLMs) designed to perceive and understand both textsand images. Starting from the Qwen-LM as a foundation, we endow it with visualcapacity by the meticulously designed (i) visual receptor, (ii) input-outputinterface, (iii) 3-stage training pipeline, and (iv) multilingual multimodalcleaned corpus. Beyond the conventional image description andquestion-answering, we implement the grounding and text-reading ability ofQwen-VLs by aligning image-caption-box tuples. The resulting models, includingQwen-VL and Qwen-VL-Chat, set new records for generalist models under similarmodel scales on a broad range of visual-centric benchmarks (e.g., imagecaptioning, question answering, visual grounding) and different settings (e.g.,zero-shot, few-shot). Moreover, on real-world dialog benchmarks, ourinstruction-tuned Qwen-VL-Chat also demonstrates superiority compared toexisting vision-language chatbots. Code, demo and models are available athttps://github.com/QwenLM/Qwen-VL.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp