HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

Dong An Yuankai Qi Yangguang Li Yan Huang Liang Wang Tieniu Tan Jing Shao

BEVBert: Multimodal Map Pre-training for Language-guided Navigation

Abstract

Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.

Code Repositories

marsaki/vln-bevbert
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-navigation-on-room-to-room-1BEV-BERT
spl: 0.60

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp