HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Video Mask Transfiner for High-Quality Video Instance Segmentation

Lei Ke Henghui Ding Martin Danelljan Yu-Wing Tai Chi-Keung Tang Fisher Yu

Video Mask Transfiner for High-Quality Video Instance Segmentation

Abstract

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that temporal consistency cues are neglected or not fully utilized. In this paper, we set out to tackle these issues, with the aim of achieving highly detailed and more temporally stable mask predictions for VIS. We first propose the Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Our VMT detects and groups sparse error-prone spatio-temporal regions of each tracklet in the video segment, which are then refined using both local and instance-level cues. Second, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor. Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set and our automatically refined training data. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the efficacy and effectiveness of our method on segmenting complex and dynamic objects, by capturing precise details.

Code Repositories

SysCV/vmt
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-instance-segmentation-on-hq-ytvisVMT (R50)
Tube-Boundary AP: 30.7
video-instance-segmentation-on-hq-ytvisVMT (R101)
Tube-Boundary AP: 32.5
video-instance-segmentation-on-hq-ytvisVMT (Swin-L)
Tube-Boundary AP: 44.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Video Mask Transfiner for High-Quality Video Instance Segmentation | Papers | HyperAI