HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Li Yanwei ; Zhang Yuechen ; Wang Chengyao ; Zhong Zhisheng ; Chen Yixin ; Chu Ruihang ; Liu Shaoteng ; Jia Jiaya

Mini-Gemini: Mining the Potential of Multi-modality Vision Language
  Models

Abstract

In this work, we introduce Mini-Gemini, a simple and effective frameworkenhancing multi-modality Vision Language Models (VLMs). Despite theadvancements in VLMs facilitating basic visual dialog and reasoning, aperformance gap persists compared to advanced models like GPT-4 and Gemini. Wetry to narrow the gap by mining the potential of VLMs for better performanceand any-to-any workflow from three aspects, i.e., high-resolution visualtokens, high-quality data, and VLM-guided generation. To enhance visual tokens,we propose to utilize an additional visual encoder for high-resolutionrefinement without increasing the visual token count. We further construct ahigh-quality dataset that promotes precise image comprehension andreasoning-based generation, expanding the operational scope of current VLMs. Ingeneral, Mini-Gemini further mines the potential of VLMs and empowers currentframeworks with image understanding, reasoning, and generation simultaneously.Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs)from 2B to 34B. It is demonstrated to achieve leading performance in severalzero-shot benchmarks and even surpasses the developed private models. Code andmodels are available at https://github.com/dvlab-research/MiniGemini.

Code Repositories

dvlab-research/MGM
pytorch
Mentioned in GitHub
dvlab-research/minigemini
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-classification-on-coloninst-v1-seenMGM-2B (w/o LoRA, w/ extra data)
Accuray: 93.24
image-classification-on-coloninst-v1-seenMGM-2B (w/o LoRA, w/o extra data)
Accuray: 92.97
image-classification-on-coloninst-v1-unseenMGM-2B (w/o LoRA, w/ extra data)
Accuray: 78.69
image-classification-on-coloninst-v1-unseenMGM-2B (w/o LoRA, w/o extra data)
Accuray: 78.99
referring-expression-generation-on-coloninstMGM-2B (w/o LoRA, w/ extra data)
Accuray: 98.75
referring-expression-generation-on-coloninstMGM-2B (w/o LoRA, w/o extra data)
Accuray: 98.17
referring-expression-generation-on-coloninst-1MGM-2B (w/o LoRA, w/ extra data)
Accuray: 74.30
referring-expression-generation-on-coloninst-1MGM-2B (w/o LoRA, w/o extra data)
Accuray: 69.81
visual-question-answering-on-mm-vetMini-Gemini
GPT-4 score: 53.0
visual-question-answering-on-mm-vetMini-Gemini-HD-BS
GPT-4 score: 60.8
visual-question-answering-on-mm-vetMini-Gemini-HD
GPT-4 score: 59.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp