Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal
Abstract
Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and fine- tuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Building on observations about the data scaling, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by 2.4% in accuracy using only 3.6% training samples. Specifically, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS’s strong reasoning performance.- Anthology ID:
- 2025.emnlp-main.1428
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 28114–28128
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1428/
- DOI:
- Cite (ACL):
- Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, and Mohit Bansal. 2025. Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28114–28128, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning (Wang et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1428.pdf