HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Hierarchical interaction network for video object segmentation from referring expressions

{Philip Torr Hengshuang Zhao Luca Bertinetto Yansong Tang Zhao Yang}

Hierarchical interaction network for video object segmentation from referring expressions

Abstract

In this paper, we investigate the problem of video object segmentation from referring expressions (VOSRE). Conventional methods typically perform multi-modal fusion based on linguistic features and the visual features extracted from the top layer of the visual encoder, which limits these models' ability to represent multi-modal inputs at different semantic and spatial granularity levels. To address this issue, we present an end-to-end hierarchical interaction network (HINet) for the VOSRE problem. Our model leverages the feature pyramid produced by the visual encoder to generate multiple levels of multi-modal features. This allows more flexible representation of various linguistic concepts (e.g., object attributes and categories) in different levels of the multi-modal features. Moreover, we further extract signals of moving objects from optical flow input, and utilize them as complementary cues for highlighting the referent and suppressing the background with a motion gating mechanism. In contrast to previous methods, this strategy allows our model to make online predictions without requiring the whole video as input. Despite its simplicity, our proposed HINet improves over the previous state of the art on the DAVIS-16, DAVIS-17, and J-HMDB datasets for the VOSRE task, demonstrating its effectiveness and generality.

Benchmarks

BenchmarkMethodologyMetrics
referring-expression-segmentation-on-a2dRefVOS
IoU mean: 0.497
IoU overall: 0.672
Precision@0.5: 0.578
Precision@0.6: 0.534
Precision@0.7: 0.456
Precision@0.8: 0.311
Precision@0.9: 0.093
referring-expression-segmentation-on-a2dHINet
IoU mean: 0.529
IoU overall: 0.679
Precision@0.5: 0.611
Precision@0.6: 0.559
Precision@0.7: 0.486
Precision@0.8: 0.342
Precision@0.9: 0.12
referring-expression-segmentation-on-davisHINet
Ju0026F 1st frame: 50.2
Ju0026F Full video: 47.9
referring-expression-segmentation-on-j-hmdbRefVOS
IoU mean: 0.568
IoU overall: 0.606
Precision@0.5: 0.731
Precision@0.6: 0.62
Precision@0.7: 0.392
Precision@0.8: 0.088
Precision@0.9: 0.0
referring-expression-segmentation-on-j-hmdbHINet
IoU mean: 0.627
IoU overall: 0.652
Precision@0.5: 0.819
Precision@0.6: 0.736
Precision@0.7: 0.542
Precision@0.8: 0.168
Precision@0.9: 0.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp