ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning

Rui Wang; Bohao Li; Xiyang Dai; Jianwei Yang; Yi-Ling Chen; Zhen Xing; Yifan Yang; Dongdong Chen; Xipeng Qiu (邱锡鹏); Zuxuan Wu; Yu-Gang Jiang

ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning

Rui Wang, Bohao Li, Xiyang Dai, Jianwei Yang, Yi-Ling Chen, Zhen Xing, Yifan Yang, Dongdong Chen, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

Abstract

Video understanding is essential for multimodal large language models (MLLMs) to interact effectively with users and the real world. However, analyzing long videos remains a major challenge due to the lack of high-quality video instruction data and effective training strategies. In this paper, we introduce a simple yet effective baseline for long-context video understanding, including dataset construction and training recipes. We curate a large-scale video instruction dataset with over 1M samples, encompassing videos from a few seconds to several minutes across diverse sources, without any human annotations. Additionally, we propose a progressive video instruction tuning strategy that incrementally increases input context length, enabling better utilization of videos of varying durations. Comprehensive experiments demonstrate that our dataset significantly outperforms existing video instruction datasets for fine-tuning MLLMs. Furthermore, our training approach establishes a strong video MLLM baseline, surpassing previous open-source models on video benchmarks and outperforming proprietary models like GPT-4V and GPT-4o-mini on VideoMME, even with a compact 7B model.

Anthology ID:: 2025.emnlp-main.1570
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30824–30837
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1570/
DOI:
Bibkey:
Cite (ACL):: Rui Wang, Bohao Li, Xiyang Dai, Jianwei Yang, Yi-Ling Chen, Zhen Xing, Yifan Yang, Dongdong Chen, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. 2025. ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30824–30837, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1570.pdf
Checklist:: 2025.emnlp-main.1570.checklist.pdf

PDF Cite Search Checklist Fix data