8 months ago

Wujian Peng Lingchen Meng Yitong Chen Yiweng Xie Yang Liu Tao Gui Hang Xu Xipeng Qiu Zuxuan Wu Yu-Gang Jiang

Abstract

Large Multimodal Models (LMMs) have made significant breakthroughs with theadvancement of instruction tuning. However, while existing models canunderstand images and videos at a holistic level, they still struggle withinstance-level understanding that requires a more nuanced comprehension andalignment. Instance-level understanding is crucial, as it focuses on thespecific elements that we are most interested in. Excitingly, existing worksfind that the state-of-the-art LMMs exhibit strong instance understandingcapabilities when provided with explicit visual cues. Motivated by this, weintroduce an automated annotation pipeline assisted by GPT-4o to extractinstance-level information from images and videos through explicit visualprompting for instance guidance. Building upon this pipeline, we proposedInst-IT, a solution to enhance LMMs in Instance understanding via explicitvisual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnosemultimodal instance-level understanding, a large-scale instruction-tuningdataset, and a continuous instruction-tuning training paradigm to effectivelyenhance spatial-temporal instance understanding capabilities of existing LMMs.Experimental results show that, with the boost of Inst-IT, our models not onlyachieve outstanding performance on Inst-IT Bench but also demonstratesignificant improvements across various generic image and video understandingbenchmarks. This highlights that our dataset not only boosts instance-levelunderstanding but also strengthens the overall capabilities of generic imageand video comprehension.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

8 months ago

Wujian Peng Lingchen Meng Yitong Chen Yiweng Xie Yang Liu Tao Gui Hang Xu Xipeng Qiu Zuxuan Wu Yu-Gang Jiang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

8 months ago

Wujian Peng Lingchen Meng Yitong Chen Yiweng Xie Yang Liu Tao Gui Hang Xu Xipeng Qiu Zuxuan Wu Yu-Gang Jiang

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng Lingchen Meng Yitong Chen Yiweng Xie Yang Liu Tao Gui Hang Xu Xipeng Qiu Zuxuan Wu Yu-Gang Jiang

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng Lingchen Meng Yitong Chen Yiweng Xie Yang Liu Tao Gui Hang Xu Xipeng Qiu Zuxuan Wu Yu-Gang Jiang

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Wujian Peng Lingchen Meng Yitong Chen Yiweng Xie Yang Liu Tao Gui Hang Xu Xipeng Qiu Zuxuan Wu Yu-Gang Jiang

Abstract

Build AI with AI

HyperAI Newsletters