Command Palette
Search for a command to run...
Peng Songyou ; Genova Kyle ; Jiang Chiyu Max ; Tagliasacchi Andrea ; Pollefeys Marc ; Funkhouser Thomas

Abstract
Traditional 3D scene understanding approaches rely on labeled 3D datasets totrain a model for a single task with supervision. We propose OpenScene, analternative approach where a model predicts dense features for 3D scene pointsthat are co-embedded with text and image pixels in CLIP feature space. Thiszero-shot approach enables task-agnostic training and open-vocabulary queries.For example, to perform SOTA zero-shot 3D semantic segmentation it first infersCLIP features for every 3D point and later classifies them based onsimilarities to embeddings of arbitrary class labels. More interestingly, itenables a suite of open-vocabulary scene understanding applications that havenever been done before. For example, it allows a user to enter an arbitrarytext query and then see a heat map indicating which parts of a scene match. Ourapproach is effective at identifying objects, materials, affordances,activities, and room types in complex 3D scenes, all using a single modeltrained without any labeled 3D data.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-open-vocabulary-instance-segmentation-on-1 | OpenScene + Mask3D | mAP: 10.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.