E-ViC: Reasoning Beyond Text via Embodied Visual Chain for Spatial Intelligence
Junbo Qi, Yi Zhang, Hanchu Ni, Che Liu, Zhimin Yao, Ruilin Yang, Xiancong Ren, Liangjian Wen, Wei Ge, Yuya Ieiri, Osamu Yoshie, Yong Dai, Xiaozhu Ju
Abstract
Precise spatial reasoning is fundamental to embodied intelligence, yet current Vision-Language Models (VLMs) remain bottlenecked by text-based Chain-of-Thought (CoT) that relies solely on textual reasoning trajectories, often bypassing active engagement with fine-grained visual details. To address this, we present E-ViC (Embodied Visual Chain), a framework that moves reasoning beyond text and directly into the visual domain. By formulating visual operations (e.g., zooming, marking) as executable primitives, E-ViC transforms perception from static prediction into an active verification process. Distinct from approaches relying on supervised step-wise trajectories, E-ViC is trained via an agentic reinforcement learning paradigm. This enables the model to autonomously discover optimal policies, leading to the emergence of human-like “look-and-confirm” strategies driven solely by task-level rewards. To facilitate this, we curate a comprehensive 24.4K-sample dataset covering diverse embodied tasks. By grounding reasoning in pixel-level interactions, E-ViC reframes spatial intelligence as a verifiable, tool-using capability. Extensive evaluations on external benchmarks demonstrate that our approach consistently outperforms strong VLM baselines with an average gain of 10.1%.- Anthology ID:
- 2026.acl-long.1870
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 40283–40303
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1870/
- DOI:
- Cite (ACL):
- Junbo Qi, Yi Zhang, Hanchu Ni, Che Liu, Zhimin Yao, Ruilin Yang, Xiancong Ren, Liangjian Wen, Wei Ge, Yuya Ieiri, Osamu Yoshie, Yong Dai, and Xiaozhu Ju. 2026. E-ViC: Reasoning Beyond Text via Embodied Visual Chain for Spatial Intelligence. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 40283–40303, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- E-ViC: Reasoning Beyond Text via Embodied Visual Chain for Spatial Intelligence (Qi et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1870.pdf