VideoPro: Adaptive Program Reasoning for Long Video Understanding

Chenglin Li; Feng Han; Yikun Wang; Ruilin Li; Shuai Dong; Haowen Hou; Haitao Li; Qianglong Chen; Feng Tao; Jingqi Tong; Yin Zhang; Jiaqi Wang

VideoPro: Adaptive Program Reasoning for Long Video Understanding

Chenglin Li, Feng Han, Yikun Wang, Ruilin Li, Shuai Dong, Haowen Hou, Haitao Li, Qianglong Chen, Feng Tao, Jingqi Tong, Yin Zhang, Jiaqi Wang

Abstract

Understanding long videos remains challenging due to the sparsity of visual evidence relevant to a given query. Prior work has explored program-based visual grounding, typically relying on executable programs generated by auxiliary large language models. However, when scaling to long videos, existing approaches face several critical limitations: (1) frame-centric vision modules are often insufficient for long video processing; (2) naively applying program-based reasoning to all queries incurs considerable computational overhead; and (3) errors arising from low-confidence predictions and imperfect program execution are difficult to recover from. To address these challenges, we propose VideoPro, a unified framework that enables VideoLLMs to adaptively reason over long videos and refine their predictions through executable programs. VideoPro first performs adaptive reasoning, dynamically determining whether a query can be resolved directly by the native VideoLLM or requires explicit multi-step program reasoning. For complex queries, the model decomposes the task into executable programs that invoke specialized vision modules for precise temporal and semantic grounding. To further improve robustness, VideoPro incorporates a self-refinement mechanism that leverages execution feedback and confidence signals to correct erroneous executions and refine low-confidence reasoning programs. By tightly integrating adaptive reasoning with self-refinement, VideoPro consistently outperforms prior methods across multiple long-video understanding benchmarks, yielding an average 6.7% improvement for Qwen3-VL-8B.

Anthology ID:: 2026.acl-long.341
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7497–7513
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.341/
DOI:
Bibkey:
Cite (ACL):: Chenglin Li, Feng Han, Yikun Wang, Ruilin Li, Shuai Dong, Haowen Hou, Haitao Li, Qianglong Chen, Feng Tao, Jingqi Tong, Yin Zhang, and Jiaqi Wang. 2026. VideoPro: Adaptive Program Reasoning for Long Video Understanding. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7497–7513, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: VideoPro: Adaptive Program Reasoning for Long Video Understanding (Li et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.341.pdf
Checklist:: 2026.acl-long.341.checklist.pdf

PDF Cite Search Checklist Fix data