ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Fangxu Yu, Ziyao Lu, Liqiang Niu, Fandong Meng, Jie Zhou


Abstract
Grounding events in videos serves as a fundamental capability in video analysis. While Vision Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.
Anthology ID:
2026.findings-acl.1730
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34657–34671
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1730/
DOI:
Bibkey:
Cite (ACL):
Fangxu Yu, Ziyao Lu, Liqiang Niu, Fandong Meng, and Jie Zhou. 2026. ArrowGEV: Grounding Events in Video via Learning the Arrow of Time. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34657–34671, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
ArrowGEV: Grounding Events in Video via Learning the Arrow of Time (Yu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1730.pdf
Checklist:
 2026.findings-acl.1730.checklist.pdf