Cornelia Fermüller

Also published as: Cornelia Fermuller

2024

pdf abs
Diving Deep into the Motion Representation of Video-Text Models
Chinmaya Devaraj | Cornelia Fermuller | Yiannis Aloimonos
Findings of the Association for Computational Linguistics ACL 2024

Videos are more informative than images becausethey capture the dynamics of the scene.By representing motion in videos, we can capturedynamic activities. In this work, we introduceGPT-4 generated motion descriptions thatcapture fine-grained motion descriptions of activitiesand apply them to three action datasets.We evaluated several video-text models on thetask of retrieval of motion descriptions. Wefound that they fall far behind human expertperformance on two action datasets, raisingthe question of whether video-text models understandmotion in videos. To address it, weintroduce a method of improving motion understandingin video-text models by utilizingmotion descriptions. This method proves tobe effective on two action datasets for the motiondescription retrieval task. The results drawattention to the need for quality captions involvingfine-grained motion information in existingdatasets and demonstrate the effectiveness ofthe proposed pipeline in understanding finegrainedmotion during video-text retrieval.

2015

pdf
Learning the Semantics of Manipulation Action
Yezhou Yang | Yiannis Aloimonos | Cornelia Fermüller | Eren Erdal Aksoy
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)