TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

Yuanze Hu; Xinyu Wang; Zhichao Yang; Gen Li; Ye Qiu; Zhaoxin Fan; Yifan Sun; Wenjun Wu; Jin Dong; Xiaotie Deng

TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

Yuanze Hu, Xinyu Wang, Zhichao Yang, Gen Li, Ye Qiu, Zhaoxin Fan, Yifan Sun, Wenjun wu, Jin Dong, Xiaotie Deng

Abstract

Lightweight Vision-Language Models (VLMs) are indispensable for resource-constrained applications. The prevailing approach to aligning vision and language models involves freezing both the vision encoder and the language model while training small connector modules. However, this strategy heavily depends on the intrinsic capabilities of the language model, which can be suboptimal for lightweight models with limited representational capacity. In this work, we investigate this alignment bottleneck through the lens of mutual information, positing that the constrained capacity of the language model inherently limits the Effective Mutual Information (EMI) between multimodal inputs and outputs, thereby compromising alignment quality. To address this challenge, we propose TinyAlign, a novel framework inspired by Retrieval-Augmented Generation, which strategically retrieves relevant context from a memory bank constructed from training data to enrich multimodal inputs and enhance their alignment. Extensive empirical evaluations reveal that TinyAlign significantly reduces training loss, accelerates convergence, and enhances task performance with negligible computational overhead. Remarkably, it allows models to achieve baseline-level performance with only 40% of the fine-tuning data, highlighting exceptional data efficiency. Our work thus offers a practical pathway for developing more capable lightweight VLMs while introducing a fresh theoretical lens to better understand and address alignment bottlenecks in constrained multimodal systems.

Anthology ID:: 2026.findings-acl.223
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4585–4597
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.223/
DOI:
Bibkey:
Cite (ACL):: Yuanze Hu, Xinyu Wang, Zhichao Yang, Gen Li, Ye Qiu, Zhaoxin Fan, Yifan Sun, Wenjun wu, Jin Dong, and Xiaotie Deng. 2026. TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks. In Findings of the Association for Computational Linguistics: ACL 2026, pages 4585–4597, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks (Hu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.223.pdf
Checklist:: 2026.findings-acl.223.checklist.pdf

PDF Cite Search Checklist Fix data