Command Palette
Search for a command to run...
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe Satya Narayan Shukla Omid Poursaeed Michael S. Ryoo Tsung-Yu Lin

Abstract
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-activitynet-qa | LocVLM-Vid-B+ | Accuracy: 38.2 |
| video-question-answering-on-activitynet-qa | LocVLM-Vid-B | Accuracy: 37.4 |
| video-question-answering-on-msr-vtt | LocVLM-Vid-B | Accuracy: 51.2 |
| video-question-answering-on-msvd-qa | LocVLM-Vid-B | Accuracy: 66.1 |
| video-question-answering-on-tgif-qa | LocVLM-Vid-B | Accuracy: 51.8 |
| visual-question-answering-on-gqa-1 | LocVLM-L | Accuracy: 50.2 |
| visual-question-answering-on-vqa-v2-test-dev-1 | LocVLM-L | Accuracy: 56.2 |
| visual-question-answering-on-vqa-v2-val-1 | LocVLM-L | Accuracy: 55.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.