3 months ago

Polar Relative Positional Encoding for Video-Language Segmentation

{Qi Tian Fei Wu Lingxi Xie Ke Ning}

Abstract

In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with spatial relations, etc. In this paper, we propose a novel Polar Relative Positional Encoding (PRPE) mechanism that represents spatial relations in a ``linguistic'' way, i.e., in terms of direction and range. Sentence feature can interact with positional embeddings in a more direct way to extract the implied relative positional relations. We also propose parameterized functions for these positional embeddings to adapt real-value directions and ranges. With PRPE, we design a Polar Attention Module (PAM) as the basic module for vision-language fusion. Our method outperforms previous best method by a large margin of 11.4% absolute improvement in terms of mAP on the challenging A2D Sentences dataset. Our method also achieves competitive performances on the J-HMDB Sentences dataset.

Benchmarks

Benchmark	Methodology	Metrics
referring-expression-segmentation-on-a2d	PRPE	AP: 0.388 IoU mean: 0.529 IoU overall: 0.661 Precision@0.5: 0.634 Precision@0.6: 0.579 Precision@0.7: 0.483 Precision@0.8: 0.322 Precision@0.9: 0.083
referring-expression-segmentation-on-j-hmdb	PRPE	AP: 0.294 Precision@0.5: 0.572 Precision@0.6: 0.690 Precision@0.7: 0.319 Precision@0.8: 0.06 Precision@0.9: 0.001

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning