3 months ago

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong Weihan Wang Ming Ding Wenmeng Yu Qingsong Lv Yan Wang Yean Cheng Shiyu Huang Junhui Ji Zhao Xue

Abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs inpursuit of enhanced vision-language fusion, efficient higher-resolutionarchitecture, and broader modalities and applications. Here we propose theCogVLM2 family, a new generation of visual language models for image and videounderstanding including CogVLM2, CogVLM2-Video and GLM-4V. As an imageunderstanding model, CogVLM2 inherits the visual expert architecture withimproved training recipes in both pre-training and post-training stages,supporting input resolution up to 1344 times 1344 pixels. As a videounderstanding model, CogVLM2-Video integrates multi-frame input with timestampsand proposes automated temporal grounding data construction. Notably, CogVLM2family has achieved state-of-the-art results on benchmarks like MMBench,MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced inhttps://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4,contributing to the advancement of the field.

Code Repositories

thudm/cogvlm2

Official

pytorch

Mentioned in GitHub

yangyucheng000/University/tree/main/model-2/cogvlm

mindspore

thudm/glm-4

Official

pytorch

Benchmarks

Benchmark	Methodology	Metrics
visual-question-answering-on-mm-vet	GLM-4V-Plus	GPT-4 score: 71.1
visual-question-answering-on-mm-vet	GLM-4V-9B	GPT-4 score: 58.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette