Command Palette
Search for a command to run...
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
Wujian Peng Lingchen Meng Yitong Chen Yiweng Xie Yang Liu Tao Gui Hang Xu Xipeng Qiu Zuxuan Wu Yu-Gang Jiang

Abstract
Large Multimodal Models (LMMs) have made significant breakthroughs with theadvancement of instruction tuning. However, while existing models canunderstand images and videos at a holistic level, they still struggle withinstance-level understanding that requires a more nuanced comprehension andalignment. Instance-level understanding is crucial, as it focuses on thespecific elements that we are most interested in. Excitingly, existing worksfind that the state-of-the-art LMMs exhibit strong instance understandingcapabilities when provided with explicit visual cues. Motivated by this, weintroduce an automated annotation pipeline assisted by GPT-4o to extractinstance-level information from images and videos through explicit visualprompting for instance guidance. Building upon this pipeline, we proposedInst-IT, a solution to enhance LMMs in Instance understanding via explicitvisual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnosemultimodal instance-level understanding, a large-scale instruction-tuningdataset, and a continuous instruction-tuning training paradigm to effectivelyenhance spatial-temporal instance understanding capabilities of existing LMMs.Experimental results show that, with the boost of Inst-IT, our models not onlyachieve outstanding performance on Inst-IT Bench but also demonstratesignificant improvements across various generic image and video understandingbenchmarks. This highlights that our dataset not only boosts instance-levelunderstanding but also strengthens the overall capabilities of generic imageand video comprehension.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-vip-bench | LLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt | GPT-4 score (bbox): 50.5 GPT-4 score (human): 49.0 |
| visual-question-answering-on-vip-bench | LLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt | GPT-4 score (bbox): 45.1 GPT-4 score (human): 48.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.