Command Palette
Search for a command to run...
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Li Yanwei ; Zhang Yuechen ; Wang Chengyao ; Zhong Zhisheng ; Chen Yixin ; Chu Ruihang ; Liu Shaoteng ; Jia Jiaya

Abstract
In this work, we introduce Mini-Gemini, a simple and effective frameworkenhancing multi-modality Vision Language Models (VLMs). Despite theadvancements in VLMs facilitating basic visual dialog and reasoning, aperformance gap persists compared to advanced models like GPT-4 and Gemini. Wetry to narrow the gap by mining the potential of VLMs for better performanceand any-to-any workflow from three aspects, i.e., high-resolution visualtokens, high-quality data, and VLM-guided generation. To enhance visual tokens,we propose to utilize an additional visual encoder for high-resolutionrefinement without increasing the visual token count. We further construct ahigh-quality dataset that promotes precise image comprehension andreasoning-based generation, expanding the operational scope of current VLMs. Ingeneral, Mini-Gemini further mines the potential of VLMs and empowers currentframeworks with image understanding, reasoning, and generation simultaneously.Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs)from 2B to 34B. It is demonstrated to achieve leading performance in severalzero-shot benchmarks and even surpasses the developed private models. Code andmodels are available at https://github.com/dvlab-research/MiniGemini.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-classification-on-coloninst-v1-seen | MGM-2B (w/o LoRA, w/ extra data) | Accuray: 93.24 |
| image-classification-on-coloninst-v1-seen | MGM-2B (w/o LoRA, w/o extra data) | Accuray: 92.97 |
| image-classification-on-coloninst-v1-unseen | MGM-2B (w/o LoRA, w/ extra data) | Accuray: 78.69 |
| image-classification-on-coloninst-v1-unseen | MGM-2B (w/o LoRA, w/o extra data) | Accuray: 78.99 |
| referring-expression-generation-on-coloninst | MGM-2B (w/o LoRA, w/ extra data) | Accuray: 98.75 |
| referring-expression-generation-on-coloninst | MGM-2B (w/o LoRA, w/o extra data) | Accuray: 98.17 |
| referring-expression-generation-on-coloninst-1 | MGM-2B (w/o LoRA, w/ extra data) | Accuray: 74.30 |
| referring-expression-generation-on-coloninst-1 | MGM-2B (w/o LoRA, w/o extra data) | Accuray: 69.81 |
| visual-question-answering-on-mm-vet | Mini-Gemini | GPT-4 score: 53.0 |
| visual-question-answering-on-mm-vet | Mini-Gemini-HD-BS | GPT-4 score: 60.8 |
| visual-question-answering-on-mm-vet | Mini-Gemini-HD | GPT-4 score: 59.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.