Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding

Daoze Zhang; Yuze Zhao; Jintao Huang; Yingda Chen

doi:10.18653/v1/2025.acl-long.222

Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding

Daoze Zhang, Yuze Zhao, Jintao Huang, Yingda Chen

Abstract

Despite existing multimodal language models showing impressive performance on the video understanding task, extremely long videos still pose significant challenges to language model’s context length, memory consumption, and computational complexity. To address these issues, we propose a vision-language model named Sophia for long video understanding, which can efficiently handle hour-scale long videos. First, we employ a Shot-adaptive Frame Pruning technique, which naturally segments long videos into multiple camera shots, to more sharply identify and focus on the frames relevant to the query. Additionally, we introduce a Hierarchical Attention mechanism to effectively model the long-term temporal dependencies between video frames, which achieves a time and space complexity of O(N) w.r.t. the input sequence length N while theoretically maintaining the global modeling efficiency. Experimentally, our Sophia exhibits competitive performance compared to existing video understanding baselines across various benchmarks for long video understanding with reduced time and memory consumption. The model code and weights are available at https://huggingface.co/Tao-tse/Sophia.

Anthology ID:: 2025.acl-long.222
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4423–4439
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.acl-long.222/
DOI:: 10.18653/v1/2025.acl-long.222
Bibkey:
Cite (ACL):: Daoze Zhang, Yuze Zhao, Jintao Huang, and Yingda Chen. 2025. Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4423–4439, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding (Zhang et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.acl-long.222.pdf

PDF Cite Search Fix data