Command Palette
Search for a command to run...
Ho Kei Cheng; Seoung Wug Oh; Brian Price; Alexander Schwing; Joon-Young Lee

Abstract
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| open-world-video-segmentation-on-burst-val | DEVA (Mask2Former) | OWTA (all): 69.9 OWTA (com): 75.2 OWTA (unc): 41.5 |
| open-world-video-segmentation-on-burst-val | DEVA (EntitySeg) | OWTA (all): 69.5 OWTA (com): 73.3 OWTA (unc): 50.5 |
| referring-expression-segmentation-on-davis | DEVA (ReferFormer) | Ju0026F 1st frame: 66.3 |
| referring-expression-segmentation-on-refer-1 | DEVA (ReferFormer) | Ju0026F: 66.0 |
| semi-supervised-video-object-segmentation-on-1 | DEVA | F-measure (Mean): 86.8 FPS: 25.3 Ju0026F: 83.2 Jaccard (Mean): 79.6 |
| semi-supervised-video-object-segmentation-on-21 | DEVA (no OVIS) | F: 64.3 FPS: 25.3 J: 55.8 Ju0026F: 60.0 |
| semi-supervised-video-object-segmentation-on-21 | DEVA (with OVIS) | F: 70.8 FPS: 25.3 J: 62.3 Ju0026F: 66.5 |
| unsupervised-video-object-segmentation-on-10 | DEVA (DIS) | F: 90.2 G: 88.9 J: 87.6 |
| unsupervised-video-object-segmentation-on-4 | DEVA (EntitySeg) | F-measure (Mean): 76.4 Ju0026F: 73.4 Jaccard (Mean): 70.4 |
| unsupervised-video-object-segmentation-on-5 | DEVA (EntitySeg) | Ju0026F: 62.1 |
| video-panoptic-segmentation-on-vipseg | DEVA (Mask2Former - SwinB) | STQ: 52.2 VPQ: 55.0 |
| visual-object-tracking-on-davis-2017 | DEVA | F-measure (Mean): 91.0 Ju0026F: 87.6 Jaccard (Mean): 84.2 Speed (FPS): 25.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.