Command Palette
Search for a command to run...
{Qi Tian Fei Wu Lingxi Xie Ke Ning}

Abstract
In this paper, we tackle a challenging task named video-language segmentation. Given a video and a sentence in natural language, the goal is to segment the object or actor described by the sentence in video frames. To accurately denote a target object, the given sentence usually refers to multiple attributes, such as nearby objects with spatial relations, etc. In this paper, we propose a novel Polar Relative Positional Encoding (PRPE) mechanism that represents spatial relations in a ``linguistic'' way, i.e., in terms of direction and range. Sentence feature can interact with positional embeddings in a more direct way to extract the implied relative positional relations. We also propose parameterized functions for these positional embeddings to adapt real-value directions and ranges. With PRPE, we design a Polar Attention Module (PAM) as the basic module for vision-language fusion. Our method outperforms previous best method by a large margin of 11.4% absolute improvement in terms of mAP on the challenging A2D Sentences dataset. Our method also achieves competitive performances on the J-HMDB Sentences dataset.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| referring-expression-segmentation-on-a2d | PRPE | AP: 0.388 IoU mean: 0.529 IoU overall: 0.661 Precision@0.5: 0.634 Precision@0.6: 0.579 Precision@0.7: 0.483 Precision@0.8: 0.322 Precision@0.9: 0.083 |
| referring-expression-segmentation-on-j-hmdb | PRPE | AP: 0.294 Precision@0.5: 0.572 Precision@0.6: 0.690 Precision@0.7: 0.319 Precision@0.8: 0.06 Precision@0.9: 0.001 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.