Hansheng Zhang
2025
Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning
Ruizhe Chen
|
Tianze Luo
|
Zhiting Fan
|
Heqing Zou
|
Zhaopeng Feng
|
Guiyang Xie
|
Hansheng Zhang
|
Zhuochen Wang
|
Zuozhu Liu
|
Zhang Huaijian
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold-start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold-start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.
Search
Fix author
Co-authors
- Ruizhe Chen 1
- Zhiting Fan 1
- Zhaopeng Feng 1
- Zhang Huaijian 1
- Zuozhu Liu 1
- show all...