On the Effects of Video Grounding on Language Models

Ehsan Doostmohammadi, Marco Kuhlmann


Abstract
Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem Transformer-based models trained on text and vision modalities try to improve the performance on multimodal downstream tasks or tackle the problem of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.of lack of grounding, e.g., addressing issues like models’ insufficient commonsense knowledge. While it is more straightforward to evaluate the effects of such models on multimodal tasks, such as visual question answering or image captioning, it is not as well-understood how these tasks affect the model itself, and its internal linguistic representations. In this work, we experiment with language models grounded in videos and measure the models’ performance on predicting masked words chosen based on their imageability. The results show that the smaller model benefits from video grounding in predicting highly imageable words, while the results for the larger model seem harder to interpret.
Anthology ID:
2022.mmmpie-1.1
Volume:
Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models
Month:
October
Year:
2022
Address:
Virtual
Venue:
MMMPIE
SIG:
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
1–6
Language:
URL:
https://aclanthology.org/2022.mmmpie-1.1
DOI:
Bibkey:
Cite (ACL):
Ehsan Doostmohammadi and Marco Kuhlmann. 2022. On the Effects of Video Grounding on Language Models. In Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models, pages 1–6, Virtual. International Conference on Computational Linguistics.
Cite (Informal):
On the Effects of Video Grounding on Language Models (Doostmohammadi & Kuhlmann, MMMPIE 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2022.mmmpie-1.1.pdf
Data
HowTo100M