Zixun Sun
2025
AdaV: Adaptive Text-visual Redirection for Vision-Language Models
Jiayi Han
|
Liang Du
|
Yiwen Wu
|
Guanming Liang
|
Xiangguo Zhou
|
Weibo Zheng
|
Donghong Han
|
Zixun Sun
Findings of the Association for Computational Linguistics: ACL 2025
The success of Vision-Language Models (VLMs) often relies on high-resolution schemes that preserve image details, while these approaches also generate an excess of visual tokens, leading to a substantial decrease in model efficiency. A typical VLM includes a visual encoder, a text encoder, and an LLM. Recent studies suggest pruning visual tokens based on visual and textual priors to accelerate VLMs without additional training costs. However, these methods often overlook prompt semantics or suffer from biased self-attention in the LLM. Inspired by the efficient mechanisms of the human brain for multimodal understanding, we introduce AdaV, a novel training-free visual token pruning method. By emulating the neural pathways that preprocess visual and auditory information before the reasoning stage, we shift text-guided visual attention redirection to the pre-LLM stage, which reduces biased token pruning and enhances model robustness with a limited visual token budget. A Self-adaptive Cross-modality Attention Redirection (SCAR) module is further proposed that effectively merges and redirects visual attention with text-to-image attention. Extensive experiments on seven challenging benchmarks demonstrate that our AdaV achieves SOTA performance in training-free VLM acceleration and can be plug-and-play on various VLMs. We plan to open-source the code upon publication.
Search
Fix author
Co-authors
- Liang Du 1
- Jiayi Han 1
- Donghong Han 1
- Guanming Liang 1
- Yiwen Wu 1
- show all...