8 months ago

Abstract

We introduce the task of open-vocabulary 3D instance segmentation. Currentapproaches for 3D instance segmentation can typically only recognize objectcategories from a pre-defined closed set of classes that are annotated in thetraining datasets. This results in important limitations for real-worldapplications where one might need to perform tasks guided by novel,open-vocabulary queries related to a wide variety of objects. Recently,open-vocabulary 3D scene understanding methods have emerged to address thisproblem by learning queryable features for each point in the scene. While sucha representation can be directly employed to perform semantic segmentation,existing methods cannot separate multiple object instances. In this work, weaddress this limitation, and propose OpenMask3D, which is a zero-shot approachfor open-vocabulary 3D instance segmentation. Guided by predictedclass-agnostic 3D instance masks, our model aggregates per-mask features viamulti-view fusion of CLIP-based image embeddings. Experiments and ablationstudies on ScanNet200 and Replica show that OpenMask3D outperforms otheropen-vocabulary methods, especially on the long-tail distribution. Qualitativeexperiments further showcase OpenMask3D's ability to segment object propertiesbased on free-form queries describing geometry, affordances, and materials.

Source PDF View Code