HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

GSVA: Generalized Segmentation via Multimodal Large Language Models

Xia Zhuofan ; Han Dongchen ; Han Yizeng ; Pan Xuran ; Song Shiji ; Huang Gao

GSVA: Generalized Segmentation via Multimodal Large Language Models

Abstract

Generalized Referring Expression Segmentation (GRES) extends the scope ofclassic RES to refer to multiple objects in one expression or identify theempty targets absent in the image. GRES poses challenges in modeling thecomplex spatial relationships of the instances in the image and identifyingnon-existing referents. Multimodal Large Language Models (MLLMs) have recentlyshown tremendous progress in these complicated vision-language tasks.Connecting Large Language Models (LLMs) and vision models, MLLMs are proficientin understanding contexts with visual inputs. Among them, LISA, as arepresentative, adopts a special [SEG] token to prompt a segmentation maskdecoder, e.g., SAM, to enable MLLMs in the RES task. However, existingsolutions to GRES remain unsatisfactory since current segmentation MLLMs cannotcorrectly handle the cases where users might reference multiple subjects in asingular prompt or provide descriptions incongruent with any image target. Inthis paper, we propose Generalized Segmentation Vision Assistant (GSVA) toaddress this gap. Specifically, GSVA reuses the [SEG] token to prompt thesegmentation model towards supporting multiple mask references simultaneouslyand innovatively learns to generate a [REJ] token to reject the null targetsexplicitly. Experiments validate GSVA's efficacy in resolving the GRES issue,marking a notable enhancement and setting a new record on the GRES benchmarkgRefCOCO dataset. GSVA also proves effective across various classic referringsegmentation and comprehension tasks.

Code Repositories

leaplabthu/gsva
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
generalized-referring-expression-segmentationGSVA-Vicuna-13B-v1.1
cIoU: 64.05
gIoU: 68.01
generalized-referring-expression-segmentationGSVA-Vicuna-7B-v1.1
cIoU: 63.29
gIoU: 66.47
generalized-referring-expression-segmentationGSVA-Llama2-13B
cIoU: 66.38
gIoU: 70.04

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp