HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

CogVLM2: Visual Language Models for Image and Video Understanding

CogVLM2: Visual Language Models for Image and Video Understanding

Abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs inpursuit of enhanced vision-language fusion, efficient higher-resolutionarchitecture, and broader modalities and applications. Here we propose theCogVLM2 family, a new generation of visual language models for image and videounderstanding including CogVLM2, CogVLM2-Video and GLM-4V. As an imageunderstanding model, CogVLM2 inherits the visual expert architecture withimproved training recipes in both pre-training and post-training stages,supporting input resolution up to 1344 times 1344 pixels. As a videounderstanding model, CogVLM2-Video integrates multi-frame input with timestampsand proposes automated temporal grounding data construction. Notably, CogVLM2family has achieved state-of-the-art results on benchmarks like MMBench,MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced inhttps://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4,contributing to the advancement of the field.

Code Repositories

thudm/cogvlm2
Official
pytorch
Mentioned in GitHub
thudm/glm-4
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-on-mm-vetGLM-4V-Plus
GPT-4 score: 71.1
visual-question-answering-on-mm-vetGLM-4V-9B
GPT-4 score: 58.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp