MoPrune: Scene-Guided Motion-Aware Token Pruning for Efficient Video Large Language Models

Wenhao Hong, Ziyang Wang, Yixin Zhang, Zilei Wang


Abstract
Video Large Language Models (VideoLLMs) struggle with the heavy computational cost of long or high-resolution videos due to massive visual token counts and the quadratic complexity of attention. Prior pruning approaches mainly rely on token importance or similarity, while largely overlooking video dynamics and the fact that different scenes exhibit different redundancy patterns. We introduce MoPrune, a training-free, scene-guided and motion-centric token pruning framework for accelerating VideoLLMs. MoPrune first segments videos into semantically coherent scenes to preserve temporal and motion consistency. Within each scene, it determines frame retention rates from intra-scene frame uniqueness. Finally, at the token level, MoPrune retains visually distinctive tokens and motion-salient tokens via a unified score, preserving both informative static details and dynamic regions. Extensive experiments across multiple VideoLLMs and public benchmarks demonstrate MoPrune’s superior efficiency–performance trade-offs. On LLaVA-OneVision, retaining 25% of visual tokens matches or slightly improves the dense baseline, and retaining 15% tokens preserves 99% of the original performance. MoPrune is fully compatible with hardware-efficient techniques such as Flash Attention.
Anthology ID:
2026.findings-acl.344
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6929–6941
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.344/
DOI:
Bibkey:
Cite (ACL):
Wenhao Hong, Ziyang Wang, Yixin Zhang, and Zilei Wang. 2026. MoPrune: Scene-Guided Motion-Aware Token Pruning for Efficient Video Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6929–6941, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MoPrune: Scene-Guided Motion-Aware Token Pruning for Efficient Video Large Language Models (Hong et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.344.pdf
Checklist:
 2026.findings-acl.344.checklist.pdf