8 months ago

Abstract

Generalized Referring Expression Segmentation (GRES) extends the scope ofclassic RES to refer to multiple objects in one expression or identify theempty targets absent in the image. GRES poses challenges in modeling thecomplex spatial relationships of the instances in the image and identifyingnon-existing referents. Multimodal Large Language Models (MLLMs) have recentlyshown tremendous progress in these complicated vision-language tasks.Connecting Large Language Models (LLMs) and vision models, MLLMs are proficientin understanding contexts with visual inputs. Among them, LISA, as arepresentative, adopts a special [SEG] token to prompt a segmentation maskdecoder, e.g., SAM, to enable MLLMs in the RES task. However, existingsolutions to GRES remain unsatisfactory since current segmentation MLLMs cannotcorrectly handle the cases where users might reference multiple subjects in asingular prompt or provide descriptions incongruent with any image target. Inthis paper, we propose Generalized Segmentation Vision Assistant (GSVA) toaddress this gap. Specifically, GSVA reuses the [SEG] token to prompt thesegmentation model towards supporting multiple mask references simultaneouslyand innovatively learns to generate a [REJ] token to reject the null targetsexplicitly. Experiments validate GSVA's efficacy in resolving the GRES issue,marking a notable enhancement and setting a new record on the GRES benchmarkgRefCOCO dataset. GSVA also proves effective across various classic referringsegmentation and comprehension tasks.

Source PDF View Code