Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference

Siyuan Wang (王思远); Dianyi Wang; Chengxing Zhou; Zejun Li; Zhihao Fan; Xuan-Jing Huang (黄萱菁); Zhongyu Wei (魏忠钰)

Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference

Siyuan Wang, Dianyi Wang, Chengxing Zhou, Zejun Li, Zhihao Fan, Xuanjing Huang, Zhongyu Wei

Abstract

Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Inspired by the concept of a visual region in the human brain, we investigate the existence of an analogous visual region within LLMs that functions as a cognitive core, and explore the potential of efficient training of LVLMs via selective layers tuning. Using Bunny-Llama-3-8B-V for detailed analysis and other three LVLMs for validation across diverse visual and textual tasks, we find that selectively updating 25% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99% of visual performance and maintain or improve textual task results, while effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which proves consistently effective across different models.

Anthology ID:: 2025.acl-long.1484
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30715–30727
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1484/
DOI:
Bibkey:
Cite (ACL):: Siyuan Wang, Dianyi Wang, Chengxing Zhou, Zejun Li, Zhihao Fan, Xuanjing Huang, and Zhongyu Wei. 2025. Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30715–30727, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference (Wang et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1484.pdf

PDF Cite Search Fix data