Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Beomsik Cho, Jaehyung Kim


Abstract
Large Vision–Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model’s decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that guides text generation in LVLMs by Referencing Vision Tokens. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization. Then, ReVisiT uses its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT consistently enhances visual grounding with minimal computational overhead, and achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to .
Anthology ID:
2026.acl-long.262
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5794–5824
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.262/
DOI:
Bibkey:
Cite (ACL):
Beomsik Cho and Jaehyung Kim. 2026. Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5794–5824, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding (Cho & Kim, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.262.pdf
Checklist:
 2026.acl-long.262.checklist.pdf