Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, Jiajun Zhang


Abstract
Recent advances in text-only “slow-thinking” reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (VRMs). However, such transfer faces critical challenges: Effective “slow thinking” in VRMs requires visual reflection, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM Reflection-V, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, Reflection-V demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, Reflection-V maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.
Anthology ID:
2025.emnlp-main.470
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9262–9281
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.470/
DOI:
Bibkey:
Cite (ACL):
Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, and Jiajun Zhang. 2025. Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9262–9281, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models (Jian et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.470.pdf
Checklist:
 2025.emnlp-main.470.checklist.pdf