5 months ago

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai Jinze ; Bai Shuai ; Yang Shusheng ; Wang Shijie ; Tan Sinan ; Wang Peng ; Lin Junyang ; Zhou Chang ; Zhou Jingren

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scalevision-language models (LVLMs) designed to perceive and understand both textsand images. Starting from the Qwen-LM as a foundation, we endow it with visualcapacity by the meticulously designed (i) visual receptor, (ii) input-outputinterface, (iii) 3-stage training pipeline, and (iv) multilingual multimodalcleaned corpus. Beyond the conventional image description andquestion-answering, we implement the grounding and text-reading ability ofQwen-VLs by aligning image-caption-box tuples. The resulting models, includingQwen-VL and Qwen-VL-Chat, set new records for generalist models under similarmodel scales on a broad range of visual-centric benchmarks (e.g., imagecaptioning, question answering, visual grounding) and different settings (e.g.,zero-shot, few-shot). Moreover, on real-world dialog benchmarks, ourinstruction-tuned Qwen-VL-Chat also demonstrates superiority compared toexisting vision-language chatbots. Code, demo and models are available athttps://github.com/QwenLM/Qwen-VL.

Code Repositories

brandon3964/multimodal-task-vector

pytorch

Mentioned in GitHub

qwenlm/qwen-vl

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
chart-question-answering-on-chartqa	Qwen-VL	1:1 Accuracy: 65.7
chart-question-answering-on-chartqa	Qwen-VL-Chat	1:1 Accuracy: 66.3
fs-mevqa-on-sme	Qwen-VL-Max	#Learning Samples (N): 16 ACC: 40.33 BLEU-4: 24.30 CIDEr: 201.47 Detection: 1.05 METEOR: 23.40 ROUGE-L: 34.52 SPICE: 26.13
mmr-total-on-mrr-benchmark	Qwen-vl-max	Total Column Score: 366
mmr-total-on-mrr-benchmark	Qwen-vl-plus	Total Column Score: 310
natural-language-visual-grounding-on	Qwen-VL	Accuracy (%): 5.2
spatial-reasoning-on-embspatial-bench	Qwen-VL-Max	Generation: 49.11
visual-question-answering-on-docvqa-test	Qwen-VL	ANLS: 0.651
visual-question-answering-on-docvqa-test	Qwen-VL-Plus	ANLS: 0.9024
visual-question-answering-on-docvqa-test	Qwen-VL-Chat	ANLS: 0.626
visual-question-answering-on-mm-vet	Qwen-VL-Max	GPT-4 score: 66.6±0.5
visual-question-answering-on-mm-vet	Qwen-VL-Plus	GPT-4 score: 61.1±0.2
visual-question-answering-on-mm-vet-v2	Qwen-VL-Max	GPT-4 score: 55.8±0.2
visual-question-answering-on-vip-bench	Qwen-VL-Chat (Coordinates)	GPT-4 score (bbox): 45.3
visual-question-answering-on-vip-bench	Qwen-VL-Chat (Visual Prompt)	GPT-4 score (bbox): 39.2 GPT-4 score (human): 41.7
visual-question-answering-vqa-on-core-mm	Qwen-VL-Chat	Abductive: 44.39 Analogical: 30.42 Deductive: 37.55 Overall score: 37.39 Params: 16B

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette