Command Palette
Search for a command to run...

Abstract
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs inpursuit of enhanced vision-language fusion, efficient higher-resolutionarchitecture, and broader modalities and applications. Here we propose theCogVLM2 family, a new generation of visual language models for image and videounderstanding including CogVLM2, CogVLM2-Video and GLM-4V. As an imageunderstanding model, CogVLM2 inherits the visual expert architecture withimproved training recipes in both pre-training and post-training stages,supporting input resolution up to 1344 times 1344 pixels. As a videounderstanding model, CogVLM2-Video integrates multi-frame input with timestampsand proposes automated temporal grounding data construction. Notably, CogVLM2family has achieved state-of-the-art results on benchmarks like MMBench,MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced inhttps://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4,contributing to the advancement of the field.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-mm-vet | GLM-4V-Plus | GPT-4 score: 71.1 |
| visual-question-answering-on-mm-vet | GLM-4V-9B | GPT-4 score: 58.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.