HyperAI

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Bai, Sule ; Li, Mingxing ; Liu, Yong ; Tang, Jing ; Zhang, Haoji ; Sun, Lei ; Chu, Xiangxiang ; Tang, Yansong
Release Date: 5/22/2025
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement
  Learning
Abstract

Traditional visual grounding methods primarily focus on single-imagescenarios with simple textual references. However, extending these methods toreal-world scenarios that involve implicit and complex instructions,particularly in conjunction with multiple images, poses significant challenges,which is mainly due to the lack of advanced reasoning ability across diversemulti-modal contexts. In this work, we aim to address the more practicaluniversal grounding task, and propose UniVG-R1, a reasoning guided multimodallarge language model (MLLM) for universal visual grounding, which enhancesreasoning capabilities through reinforcement learning (RL) combined withcold-start data. Specifically, we first construct a high-qualityChain-of-Thought (CoT) grounding dataset, annotated with detailed reasoningchains, to guide the model towards correct reasoning paths via supervisedfine-tuning. Subsequently, we perform rule-based reinforcement learning toencourage the model to identify correct reasoning chains, thereby incentivizingits reasoning capabilities. In addition, we identify a difficulty bias arisingfrom the prevalence of easy samples as RL training progresses, and we propose adifficulty-aware weight adjustment strategy to further strengthen theperformance. Experimental results demonstrate the effectiveness of UniVG-R1,which achieves state-of-the-art performance on MIG-Bench with a 9.1%improvement over the previous method. Furthermore, our model exhibits stronggeneralizability, achieving an average improvement of 23.4% in zero-shotperformance across four image and video reasoning grounding benchmarks. Theproject page can be accessed at https://amap-ml.github.io/UniVG-R1-page/.