Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Tao Jin, Zhou Zhao


Abstract
3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. With limited data, Chat-3D achieves a 82.2% relative score compared with GPT-4 on the constructed instruction dataset, and comparable performance to state-of-the-art LLM-based methods.
Anthology ID:
2025.findings-naacl.18
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
313–333
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.18/
DOI:
Bibkey:
Cite (ACL):
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Tao Jin, and Zhou Zhao. 2025. Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 313–333, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception (Wang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.18.pdf