Command Palette
Search for a command to run...
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Dai Ming ; Li Jian ; Zhuang Jiedong ; Zhang Xian ; Yang Wankou

Abstract
Multi-task visual grounding involves the simultaneous execution oflocalization and segmentation in images based on textual expressions. Themajority of advanced methods predominantly focus on transformer-basedmultimodal fusion, aiming to extract robust multimodal representations.However, ambiguity between referring expression comprehension (REC) andreferring image segmentation (RIS) is error-prone, leading to inconsistenciesbetween multi-task predictions. Besides, insufficient multimodal understandingdirectly contributes to biased target perception. To overcome these challenges,we propose a Coarse-to-fine Consistency Constraints Visual Groundingarchitecture ($\text{C}^3\text{VG}$), which integrates implicit and explicitmodeling approaches within a two-stage framework. Initially, query and pixeldecoders are employed to generate preliminary detection and segmentationoutputs, a process referred to as the Rough Semantic Perception (RSP) stage.These coarse predictions are subsequently refined through the proposedMask-guided Interaction Module (MIM) and a novel explicit bidirectionalconsistency constraint loss to ensure consistent representations across tasks,which we term the Refined Consistency Interaction (RCI) stage. Furthermore, toaddress the challenge of insufficient multimodal understanding, we leveragepre-trained models based on visual-linguistic fusion representations. Empiricalevaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate theefficacy and soundness of $\text{C}^3\text{VG}$, which significantlyoutperforms state-of-the-art REC and RIS methods by a substantial margin. Codeand model will be available at \url{https://github.com/Dmmm1997/C3VG}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| referring-expression-segmentation-on-refcoco | C3VG | Overall IoU: 80.89 |
| referring-expression-segmentation-on-refcoco-3 | C3VG | Overall IoU: 74.68 |
| referring-expression-segmentation-on-refcoco-4 | C3VG | Overall IoU: 77.96 |
| referring-expression-segmentation-on-refcoco-5 | C3VG | Overall IoU: 68.95 |
| referring-expression-segmentation-on-refcoco-8 | C3VG | Overall IoU: 83.18 |
| referring-expression-segmentation-on-refcoco-9 | C3VG | Overall IoU: 77.86 |
| referring-expression-segmentation-on-refcocog | C3VG | Overall IoU: 74.43 |
| referring-expression-segmentation-on-refcocog-1 | C3VG | Overall IoU: 76.39 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.