TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos
Fawad Javed Fateh, Umer Ahmed, Hamza Khan, Zeeshan Zia, Quoc-Huy Tran
Abstract
We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation.- Anthology ID:
- 2026.findings-acl.70
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1427–1447
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.70/
- DOI:
- Cite (ACL):
- Fawad Javed Fateh, Umer Ahmed, Hamza Khan, Zeeshan Zia, and Quoc-Huy Tran. 2026. TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos. In Findings of the Association for Computational Linguistics: ACL 2026, pages 1427–1447, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos (Fateh et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.70.pdf