8 months ago

Abstract

The unprecedented advancements in Large Language Models (LLMs) have shown aprofound impact on natural language processing but are yet to fully embrace therealm of 3D understanding. This paper introduces PointLLM, a preliminary effortto fill this gap, enabling LLMs to understand point clouds and offering a newavenue beyond 2D visual data. PointLLM understands colored object point cloudswith human instructions and generates contextually appropriate responses,illustrating its grasp of point clouds and common sense. Specifically, itleverages a point cloud encoder with a powerful LLM to effectively fusegeometric, appearance, and linguistic information. We collect a novel datasetcomprising 660K simple and 70K complex point-text instruction pairs to enable atwo-stage training strategy: aligning latent spaces and subsequentlyinstruction-tuning the unified model. To rigorously evaluate the perceptual andgeneralization capabilities of PointLLM, we establish two benchmarks:Generative 3D Object Classification and 3D Object Captioning, assessed throughthree different methods, including human evaluation, GPT-4/ChatGPT evaluation,and traditional metrics. Experimental results reveal PointLLM's superiorperformance over existing 2D and 3D baselines, with a notable achievement inhuman-evaluated object captioning tasks where it surpasses human annotators inover 50% of the samples. Codes, datasets, and benchmarks are available athttps://github.com/OpenRobotLab/PointLLM .

Source PDF View Code