iMOVE : Instance-Motion-Aware Video Understanding

Jiaze Li, Yaya Shi, Zongyang Ma, Haoran Xu, Yandong.bai Yandong.bai, Huihui Xiao, Ruiwen Kang, Fan Yang, Tingting Gao, Di Zhang


Abstract
Enhancing the fine-grained instance spatiotemporal motion perception capabilities of Video Large Language Models is crucial for improving their temporal and general video understanding. However, current models struggle to perceive detailed and complex instance motions. To address these challenges, we have made improvements from both data and model perspectives. In terms of data, we have meticulously curated iMOVE-IT, the first large-scale instance-motion-aware video instruction-tuning dataset. This dataset is enriched with comprehensive instance motion annotations and spatiotemporal mutual-supervision tasks, providing extensive training for the model’s instance-motion-awareness. Building on this foundation, we introduce iMOVE, an instance-motion-aware video foundation model that utilizes Event-aware Spatiotemporal Efficient Modeling to retain informative instance spatiotemporal motion details while maintaining computational efficiency. It also incorporates Relative Spatiotemporal Position Tokens to ensure awareness of instance spatiotemporal positions. Evaluations indicate that iMOVE excels not only in video temporal understanding and general video understanding but also demonstrates significant advantages in long-term video understanding. We will release the data, code, and model weights after acceptance.
Anthology ID:
2025.findings-acl.1228
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23959–23975
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1228/
DOI:
Bibkey:
Cite (ACL):
Jiaze Li, Yaya Shi, Zongyang Ma, Haoran Xu, Yandong.bai Yandong.bai, Huihui Xiao, Ruiwen Kang, Fan Yang, Tingting Gao, and Di Zhang. 2025. iMOVE : Instance-Motion-Aware Video Understanding. In Findings of the Association for Computational Linguistics: ACL 2025, pages 23959–23975, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
iMOVE : Instance-Motion-Aware Video Understanding (Li et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1228.pdf