Bridging Language and Scenes through Explicit 3-D Model Construction

Tiansi Dong; Writwick Das; Rafet Sifa

Bridging Language and Scenes through Explicit 3-D Model Construction

Abstract

We introduce the methodology of explicit model construction to bridge linguistic descriptions and scene perception and demonstrate that in Visual Question-Answering (VQA) using MC4VQA (Model Construction for Visual Question-Answering), a method developed by us. Given a question about a scene, our MC4VQA first recognizes objects utilizing pre-trained deep learning systems. Then, it constructs an explicit 3-D layout by repeatedly reducing the difference between the input scene image and the image rendered from the current 3-D spatial environment. This novel “iterative rendering” process endows MC4VQA the capability of acquiring spatial attributes without training data. MC4VQA outperforms NS-VQA (the SOTA system) by reaching 99.94% accuracy on the benchmark CLEVR datasets, and is more robust than NS-VQA on new testing datasets. With newly created testing data, NS-VQA’s performance dropped to 97.60%, while MC4VQA still kept the 99.0% accuracy. This work sets a new SOTA performance of VQA on the benchmark CLEVR datasets, and shapes a new method that may solve the out-of-distribution problem.

Anthology ID:: 2025.neusymbridge-1.6
Volume:: Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Kang Liu, Yangqiu Song, Zhen Han, Rafet Sifa, Shizhu He, Yunfei Long
Venues:: NeusymBridge | WS
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 51–60
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.neusymbridge-1.6/
DOI:
Bibkey:
Cite (ACL):: Tiansi Dong, Writwick Das, and Rafet Sifa. 2025. Bridging Language and Scenes through Explicit 3-D Model Construction. In Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025, pages 51–60, Abu Dhabi, UAE. ELRA and ICCL.
Cite (Informal):: Bridging Language and Scenes through Explicit 3-D Model Construction (Dong et al., NeusymBridge 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.neusymbridge-1.6.pdf

PDF Cite Search Fix data