HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng Lingchen Meng Yitong Chen Yiweng Xie Yang Liu Tao Gui Hang Xu Xipeng Qiu Zuxuan Wu Yu-Gang Jiang

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual
  Prompt Instruction Tuning

Abstract

Large Multimodal Models (LMMs) have made significant breakthroughs with theadvancement of instruction tuning. However, while existing models canunderstand images and videos at a holistic level, they still struggle withinstance-level understanding that requires a more nuanced comprehension andalignment. Instance-level understanding is crucial, as it focuses on thespecific elements that we are most interested in. Excitingly, existing worksfind that the state-of-the-art LMMs exhibit strong instance understandingcapabilities when provided with explicit visual cues. Motivated by this, weintroduce an automated annotation pipeline assisted by GPT-4o to extractinstance-level information from images and videos through explicit visualprompting for instance guidance. Building upon this pipeline, we proposedInst-IT, a solution to enhance LMMs in Instance understanding via explicitvisual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnosemultimodal instance-level understanding, a large-scale instruction-tuningdataset, and a continuous instruction-tuning training paradigm to effectivelyenhance spatial-temporal instance understanding capabilities of existing LMMs.Experimental results show that, with the boost of Inst-IT, our models not onlyachieve outstanding performance on Inst-IT Bench but also demonstratesignificant improvements across various generic image and video understandingbenchmarks. This highlights that our dataset not only boosts instance-levelunderstanding but also strengthens the overall capabilities of generic imageand video comprehension.

Code Repositories

inst-it/inst-it
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-on-vip-benchLLaVA-NeXT-Inst-IT-Qwen2-7B (Visual Prompt
GPT-4 score (bbox): 50.5
GPT-4 score (human): 49.0
visual-question-answering-on-vip-benchLLaVA-NeXT-Inst-IT-Vicuna-7B (Visual Prompt
GPT-4 score (bbox): 45.1
GPT-4 score (human): 48.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp