Dongling Li

2026

This paper focuses on the task of answering complex visual questions that involve cross-dimensional (like 2D to 3D) spatial reasoning. This task (called SpatialQA) can enhance the machine’s spatial cognitive abilities in "plane representation - space reconstruction - semantic inference," having great application value. Existing methods often only recognize 1-D visual objects and relations, but they lack the ability to represent in a cross-dimensional space and fail to grasp structured geometric knowledge such as face-face topology and texture details. That would cause problems such as texture misalignment and topological confusion, leading to error accumulation and incorrect answers. To address this problem, we propose a new method with good cross-dimensional reasoning capabilities. In detail, we first analyze the input image, capturing its relations in the 2D plane. To derive the topological relations in the 3D space, we employ a dual-channel augmentation technique to retrieve topological isomorphic examples and geometric rules, supplementing the missing but crucial reasoning clues. We then design a multi-perspective verifier to find the inconsistencies of the macroscopic outlines, eliminating incorrect options. Based on visual clues, we develop a question-guided detector to analyze the texture details and relations of each surface finely, capturing inconsistencies in a micro level. That can correct the reasoning bias to derive the right answer. Moreover, we create a large-scale dataset with 22,483 samples to conduct evaluations. The results show the effectiveness of our method.

Co-authors

Jianxing Yu 1

Venues

Findings1

Fix author