LLMs as Bridges: Reformulating Grounded Multimodal Named Entity
Recognition
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition
Li Jinyuan ; Li Han ; Sun Di ; Wang Jiahao ; Zhang Wenkun ; Wang Zan ; Pan Gang

Abstract
Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodaltask that aims to identify named entities, entity types and their correspondingvisual regions. GMNER task exhibits two challenging properties: 1) The weakcorrelation between image-text pairs in social media results in a significantportion of named entities being ungroundable. 2) There exists a distinctionbetween coarse-grained referring expressions commonly used in similar tasks(e.g., phrase localization, referring expression comprehension) andfine-grained named entities. In this paper, we propose RiVEG, a unifiedframework that reformulates GMNER into a joint MNER-VE-VG task by leveraginglarge language models (LLMs) as a connecting bridge. This reformulation bringstwo benefits: 1) It maintains the optimal MNER performance and eliminates theneed for employing object detection methods to pre-extract regional features,thereby naturally addressing two major limitations of existing GMNER methods.2) The introduction of entity expansion expression and Visual Entailment (VE)module unifies Visual Grounding (VG) and Entity Grounding (EG). It enablesRiVEG to effortlessly inherit the Visual Entailment and Visual Groundingcapabilities of any current or prospective multimodal pretraining models.Extensive experiments demonstrate that RiVEG outperforms state-of-the-artmethods on the existing GMNER dataset and achieves absolute leads of 10.65%,6.21%, and 8.83% in all three subtasks.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| grounded-multimodal-named-entity-recognition | RiVEG | F1: 67.06 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.