SceneLM: 3D-Aware Language Models for Editable 3D Scene Synthesis

Xingbo Yao, Xiaoyu Chen, Doudou Zhang, Mingzhi Sheng, Boyuan Cao, Ying-Cong Chen, Hui Xiong


Abstract
Synthesizing an editable 3D scene from a single RGB image is central to content creation, embodied-agent data generation, and AR/VR, yet remains challenging to achieve both high-fidelity reconstruction and convenient interactive editing. Existing geometry-based pipelines produce high-quality 3D results but are typically hard to refine without rerunning the full process, while LLM-driven procedural systems enable interactive tool use but are mostly text-driven and lack precise metric 3D understanding from images. We present SceneLM, a language-model-based framework that grounds 3D scene synthesis in visual evidence by recovering an executable metric 3D layout directly from a single image. Given an RGB image (and camera intrinsics when available), SceneLM outputs a JSON-form layout specifying each object’s category, 3D center, size, and discretized yaw, and then deterministically executes this layout with a tool suite to instantiate, place, and edit objects for iterative refinement. To train metric layout recovery at scale, we curate five datasets covering diverse indoor, outdoor, and tabletop scenes and convert heterogeneous 3D annotations into a unified instruction-tuning format. To improve numerical stability and metric accuracy while preserving the text interface, we augment autoregressive JSON generation with a lightweight geometry prediction branch and dual supervision. Experiments show that SceneLM substantially improves single-image 3D layout estimation over strong open and proprietary MLLM baselines, and yields higher-quality end-to-end scene generation in geometric consistency, physical plausibility, semantic alignment, and realism.
Anthology ID:
2026.findings-acl.2116
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42615–42624
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2116/
DOI:
Bibkey:
Cite (ACL):
Xingbo Yao, Xiaoyu Chen, Doudou Zhang, Mingzhi Sheng, Boyuan Cao, Ying-Cong Chen, and Hui Xiong. 2026. SceneLM: 3D-Aware Language Models for Editable 3D Scene Synthesis. In Findings of the Association for Computational Linguistics: ACL 2026, pages 42615–42624, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SceneLM: 3D-Aware Language Models for Editable 3D Scene Synthesis (Yao et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2116.pdf
Checklist:
 2026.findings-acl.2116.checklist.pdf