Diving Deep into the Motion Representation of Video-Text Models

Chinmaya Devaraj; Cornelia Fermüller; Yiannis Aloimonos

doi:10.18653/v1/2024.findings-acl.747

Diving Deep into the Motion Representation of Video-Text Models

Chinmaya Devaraj, Cornelia Fermuller, Yiannis Aloimonos

Abstract

Videos are more informative than images becausethey capture the dynamics of the scene.By representing motion in videos, we can capturedynamic activities. In this work, we introduceGPT-4 generated motion descriptions thatcapture fine-grained motion descriptions of activitiesand apply them to three action datasets.We evaluated several video-text models on thetask of retrieval of motion descriptions. Wefound that they fall far behind human expertperformance on two action datasets, raisingthe question of whether video-text models understandmotion in videos. To address it, weintroduce a method of improving motion understandingin video-text models by utilizingmotion descriptions. This method proves tobe effective on two action datasets for the motiondescription retrieval task. The results drawattention to the need for quality captions involvingfine-grained motion information in existingdatasets and demonstrate the effectiveness ofthe proposed pipeline in understanding finegrainedmotion during video-text retrieval.

Anthology ID:: 2024.findings-acl.747
Volume:: Findings of the Association for Computational Linguistics: ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12575–12584
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2024.findings-acl.747/
DOI:: 10.18653/v1/2024.findings-acl.747
Bibkey:
Cite (ACL):: Chinmaya Devaraj, Cornelia Fermuller, and Yiannis Aloimonos. 2024. Diving Deep into the Motion Representation of Video-Text Models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 12575–12584, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Diving Deep into the Motion Representation of Video-Text Models (Devaraj et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2024.findings-acl.747.pdf

PDF Cite Search Fix data