Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang


Abstract
Typical large vision-language models (LVLMs) apply autoregressive supervision primarily to textual responses, without fully exploiting causal learning over rich visual inputs. As a result, these models often emphasize vision-to-language alignment while potentially overlooking fine-grained visual information. While prior work has explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. ASVR trains models to autoregressively reconstruct the semantic content of input images, which consistently enhances multimodal comprehension. Notably, we show that even when provided with continuous image features as input, models can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across various multimodal understanding benchmarks. ASVR delivers significant performance gains and scalability across varying data scales, visual input, visual supervision and model architectures. In particular, ASVR generally improves baselines by 2-3% across 14 multimodal benchmarks.
Anthology ID:
2026.findings-acl.1900
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38101–38115
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1900/
DOI:
Bibkey:
Cite (ACL):
Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, and Jiaqi Wang. 2026. Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38101–38115, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better (Wang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1900.pdf
Checklist:
 2026.findings-acl.1900.checklist.pdf