PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Xiaohu Huang, Hao Zhou, Kai Han


Abstract
We introduce PruneVid, a training-free visual token pruning method designed to enhance the efficiency of multimodal video understanding. While Large Language Models (LLMs) have shown promising performance on video tasks due to their advanced visual comprehension capabilities, the substantial redundancy inherent in video data poses significant computational challenges. To address this issue, PruneVid (1) reduces intrinsic video redundancy by merging temporally static and spatially similar tokens, and (2) leverages LLMs’ inherent ability to selectively prune visual tokens irrelevant to specific queries, thereby improving model efficiency. We validate our method across multiple video benchmarks, demonstrating that PruneVid can prune over 80% of tokens while maintaining competitive performance when combined with different video LLMs. Our results highlight PruneVid’s superior effectiveness and efficiency compared to existing pruning methods.
Anthology ID:
2025.findings-acl.1024
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19959–19973
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1024/
DOI:
Bibkey:
Cite (ACL):
Xiaohu Huang, Hao Zhou, and Kai Han. 2025. PruneVid: Visual Token Pruning for Efficient Video Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19959–19973, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
PruneVid: Visual Token Pruning for Efficient Video Large Language Models (Huang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1024.pdf