@inproceedings{zhang-etal-2025-sharper,
    title = "Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding",
    author = "Zhang, Daoze  and
      Zhao, Yuze  and
      Huang, Jintao  and
      Chen, Yingda",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.acl-long.222/",
    doi = "10.18653/v1/2025.acl-long.222",
    pages = "4423--4439",
    ISBN = "979-8-89176-251-0",
    abstract = "Despite existing multimodal language models showing impressive performance on the video understanding task, extremely long videos still pose significant challenges to language model{'}s context length, memory consumption, and computational complexity. To address these issues, we propose a vision-language model named Sophia for long video understanding, which can efficiently handle hour-scale long videos. First, we employ a Shot-adaptive Frame Pruning technique, which naturally segments long videos into multiple camera shots, to more sharply identify and focus on the frames relevant to the query. Additionally, we introduce a Hierarchical Attention mechanism to effectively model the long-term temporal dependencies between video frames, which achieves a time and space complexity of O(N) w.r.t. the input sequence length N while theoretically maintaining the global modeling efficiency. Experimentally, our Sophia exhibits competitive performance compared to existing video understanding baselines across various benchmarks for long video understanding with reduced time and memory consumption. The model code and weights are available at https://huggingface.co/Tao-tse/Sophia."
}Markdown (Informal)
[Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding](https://preview.aclanthology.org/ingest-emnlp/2025.acl-long.222/) (Zhang et al., ACL 2025)
ACL