Exposing the Limits of Video-Text Models through Contrast Sets

Jae Sung Park; Sheng Shen; Ali Farhadi; Trevor Darrell; Yejin Choi; Anna Rohrbach

doi:10.18653/v1/2022.naacl-main.261

Exposing the Limits of Video-Text Models through Contrast Sets

Jae Sung Park, Sheng Shen, Ali Farhadi, Trevor Darrell, Yejin Choi, Anna Rohrbach

Abstract

Recent video-text models can retrieve relevant videos based on text with a high accuracy, but to what extent do they comprehend the semantics of the text? Can they discriminate between similar entities and actions? To answer this, we propose an evaluation framework that probes video-text models with hard negatives. We automatically build contrast sets, where true textual descriptions are manipulated in ways that change their semantics while maintaining plausibility. Specifically, we leverage a pre-trained language model and a set of heuristics to create verb and person entity focused contrast sets. We apply these in the multiple choice video to-text classification setting. We test the robustness of recent methods on the proposed automatic contrast sets, and compare them to additionally collected human-generated counterparts, to assess their effectiveness. We see that model performance suffers across all methods, erasing the gap between recent CLIP-based methods vs. the earlier methods.

Anthology ID:: 2022.naacl-main.261
Volume:: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: July
Year:: 2022
Address:: Seattle, United States
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3574–3586
Language:
URL:: https://aclanthology.org/2022.naacl-main.261
DOI:: 10.18653/v1/2022.naacl-main.261
Bibkey:
Cite (ACL):: Jae Sung Park, Sheng Shen, Ali Farhadi, Trevor Darrell, Yejin Choi, and Anna Rohrbach. 2022. Exposing the Limits of Video-Text Models through Contrast Sets. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3574–3586, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: Exposing the Limits of Video-Text Models through Contrast Sets (Park et al., NAACL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/starsem-semeval-split/2022.naacl-main.261.pdf
Video:: https://preview.aclanthology.org/starsem-semeval-split/2022.naacl-main.261.mp4
Code: jamespark3922/video-lang-contrast-set

PDF Search Code Video