HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai Jinze ; Bai Shuai ; Yang Shusheng ; Wang Shijie ; Tan Sinan ; Wang Peng ; Lin Junyang ; Zhou Chang ; Zhou Jingren

Qwen-VL: A Versatile Vision-Language Model for Understanding,
  Localization, Text Reading, and Beyond

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scalevision-language models (LVLMs) designed to perceive and understand both textsand images. Starting from the Qwen-LM as a foundation, we endow it with visualcapacity by the meticulously designed (i) visual receptor, (ii) input-outputinterface, (iii) 3-stage training pipeline, and (iv) multilingual multimodalcleaned corpus. Beyond the conventional image description andquestion-answering, we implement the grounding and text-reading ability ofQwen-VLs by aligning image-caption-box tuples. The resulting models, includingQwen-VL and Qwen-VL-Chat, set new records for generalist models under similarmodel scales on a broad range of visual-centric benchmarks (e.g., imagecaptioning, question answering, visual grounding) and different settings (e.g.,zero-shot, few-shot). Moreover, on real-world dialog benchmarks, ourinstruction-tuned Qwen-VL-Chat also demonstrates superiority compared toexisting vision-language chatbots. Code, demo and models are available athttps://github.com/QwenLM/Qwen-VL.

Code Repositories

brandon3964/multimodal-task-vector
pytorch
Mentioned in GitHub
qwenlm/qwen-vl
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
chart-question-answering-on-chartqaQwen-VL
1:1 Accuracy: 65.7
chart-question-answering-on-chartqaQwen-VL-Chat
1:1 Accuracy: 66.3
fs-mevqa-on-smeQwen-VL-Max
#Learning Samples (N): 16
ACC: 40.33
BLEU-4: 24.30
CIDEr: 201.47
Detection: 1.05
METEOR: 23.40
ROUGE-L: 34.52
SPICE: 26.13
mmr-total-on-mrr-benchmarkQwen-vl-max
Total Column Score: 366
mmr-total-on-mrr-benchmarkQwen-vl-plus
Total Column Score: 310
natural-language-visual-grounding-onQwen-VL
Accuracy (%): 5.2
spatial-reasoning-on-embspatial-benchQwen-VL-Max
Generation: 49.11
visual-question-answering-on-docvqa-testQwen-VL
ANLS: 0.651
visual-question-answering-on-docvqa-testQwen-VL-Plus
ANLS: 0.9024
visual-question-answering-on-docvqa-testQwen-VL-Chat
ANLS: 0.626
visual-question-answering-on-mm-vetQwen-VL-Max
GPT-4 score: 66.6±0.5
visual-question-answering-on-mm-vetQwen-VL-Plus
GPT-4 score: 61.1±0.2
visual-question-answering-on-mm-vet-v2Qwen-VL-Max
GPT-4 score: 55.8±0.2
visual-question-answering-on-vip-benchQwen-VL-Chat (Coordinates)
GPT-4 score (bbox): 45.3
visual-question-answering-on-vip-benchQwen-VL-Chat (Visual Prompt)
GPT-4 score (bbox): 39.2
GPT-4 score (human): 41.7
visual-question-answering-vqa-on-core-mmQwen-VL-Chat
Abductive: 44.39
Analogical: 30.42
Deductive: 37.55
Overall score: 37.39
Params: 16B

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp