HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Boudjoghra Mohamed El Amine ; Dai Angela ; Lahoud Jean ; Cholakkal Hisham ; Anwer Rao Muhammad ; Khan Salman ; Khan Fahad Shahbaz

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance
  Segmentation

Abstract

Recent works on open-vocabulary 3D instance segmentation show strong promise,but at the cost of slow inference speed and high computation requirements. Thishigh computation cost is typically due to their heavy reliance on 3D clipfeatures, which require computationally expensive 2D foundation models likeSegment Anything (SAM) and CLIP for multi-view aggregation into 3D. As aconsequence, this hampers their applicability in many real-world applicationsthat require both fast and accurate predictions. To this end, we propose a fastyet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO3D, that effectively leverages only 2D object detection from multi-view RGBimages for open-vocabulary 3D instance segmentation. We address this task bygenerating class-agnostic 3D masks for objects in the scene and associatingthem with text prompts. We observe that the projection of class-agnostic 3Dpoint cloud instances already holds instance information; thus, using SAM mightonly result in redundancy that unnecessarily increases the inference time. Weempirically find that a better performance of matching text prompts to 3D maskscan be achieved in a faster fashion with a 2D object detector. We validate ourOpen-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios:(i) with ground truth masks, where labels are required for given objectproposals, and (ii) with class-agnostic 3D proposals generated from a 3Dproposal network. Our Open-YOLO 3D achieves state-of-the-art performance onboth datasets while obtaining up to $\sim$16$\times$ speedup compared to thebest existing method in literature. On ScanNet200 val. set, our Open-YOLO 3Dachieves mean average precision (mAP) of 24.7\% while operating at 22 secondsper scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.

Code Repositories

aminebdj/openyolo3d
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
3d-open-vocabulary-instance-segmentation-onOpen-YOLO 3D
AP Common: 24.3
AP Head: 27.8
AP Tail: 21.6
AP25: 36.2
AP50: 31.7
mAP: 24.7
3d-open-vocabulary-instance-segmentation-on-1Open-YOLO 3D
mAP: 23.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp