Bunny-v1.0-3B
(w/ LoRA, w/ extra data) | 79.50 | Efficient Multimodal Learning from Data-centric Perspective | |
LLaVA-Med-v1.5
(w/ LoRA, w/o extra data) | 79.24 | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | |
MobileVLM-1.7B
(w/o LoRA, w/ extra data) | 78.75 | MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices | |
MiniGPT-v2
(w/ LoRA, w/ extra data) | 76.82 | MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | |
LLaVA-Med-v1.0
(w/o LoRA, w/o extra data) | 78.04 | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | |
LLaVA-v1.5
(w/ LoRA, w/o extra data) | 79.10 | Improved Baselines with Visual Instruction Tuning | |
ColonGPT (w/ LoRA, w/o extra data) | 83.24 | Frontiers in Intelligent Colonoscopy | |
MGM-2B
(w/o LoRA, w/ extra data) | 78.69 | Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | |
MGM-2B
(w/o LoRA, w/o extra data) | 78.99 | Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models | |
LLaVA-Med-v1.0
(w/o LoRA, w/ extra data) | 77.38 | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | |
LLaVA-v1
(w/ LoRA, w/ extra data) | 42.17 | Visual Instruction Tuning | |
Bunny-v1.0-3B
(w/ LoRA, w/o extra data) | 75.50 | Efficient Multimodal Learning from Data-centric Perspective | |
LLaVA-v1
(w/ LoRA, w/o extra data) | 72.08 | Visual Instruction Tuning | |
LLaVA-v1.5
(w/ LoRA, w/ extra data) | 80.89 | Improved Baselines with Visual Instruction Tuning | |
MiniGPT-v2
(w/ LoRA, w/o extra data) | 77.93 | MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning | |
LLaVA-Med-v1.5
(w/ LoRA, w/ extra data) | 66.51 | LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day | |
MobileVLM-1.7B
(w/ LoRA, w/ extra data) | 80.44 | MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices | |