AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding

Xiao Wang, Qingyi Si, Shiyu Zhu, Jianlong Wu, Li Cao, Liqiang Nie


Abstract
Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose **AdaReTaKe**, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench.
Anthology ID:
2025.findings-acl.283
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5417–5432
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2025.findings-acl.283/
DOI:
10.18653/v1/2025.findings-acl.283
Bibkey:
Cite (ACL):
Xiao Wang, Qingyi Si, Shiyu Zhu, Jianlong Wu, Li Cao, and Liqiang Nie. 2025. AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding. In Findings of the Association for Computational Linguistics: ACL 2025, pages 5417–5432, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding (Wang et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2025.findings-acl.283.pdf