8 months ago

Abstract

We present Modular interactive VOS (MiVOS) framework which decouplesinteraction-to-mask and mask propagation, allowing for higher generalizabilityand better performance. Trained separately, the interaction module convertsuser interactions to an object mask, which is then temporally propagated by ourpropagation module using a novel top- $k$ filtering strategy in reading thespace-time memory. To effectively take the user's intent into account, a noveldifference-aware module is proposed to learn how to properly fuse the masksbefore and after each interaction, which are aligned with the target frames byemploying the space-time memory. We evaluate our method both qualitatively andquantitatively with different forms of user interactions (e.g., scribbles,clicks) on DAVIS to show that our method outperforms current state-of-the-artalgorithms while requiring fewer frame interactions, with the additionaladvantage in generalizing to different types of user interactions. Wecontribute a large-scale synthetic VOS dataset with pixel-accurate segmentationof 4.8M frames to accompany our source codes to facilitate future research.

Source PDF