Less Descriptive yet Discriminative: Quantifying the Properties of Multimodal Referring Utterances via CLIP

Ece Takmaz; Sandro Pezzelle; Raquel Fernández

doi:10.18653/v1/2022.cmcl-1.4

Less Descriptive yet Discriminative: Quantifying the Properties of Multimodal Referring Utterances via CLIP

Ece Takmaz, Sandro Pezzelle, Raquel Fernández

Abstract

In this work, we use a transformer-based pre-trained multimodal model, CLIP, to shed light on the mechanisms employed by human speakers when referring to visual entities. In particular, we use CLIP to quantify the degree of descriptiveness (how well an utterance describes an image in isolation) and discriminativeness (to what extent an utterance is effective in picking out a single image among similar images) of human referring utterances within multimodal dialogues. Overall, our results show that utterances become less descriptive over time while their discriminativeness remains unchanged. Through analysis, we propose that this trend could be due to participants relying on the previous mentions in the dialogue history, as well as being able to distill the most discriminative information from the visual context. In general, our study opens up the possibility of using this and similar models to quantify patterns in human data and shed light on the underlying cognitive mechanisms.

Anthology ID:: 2022.cmcl-1.4
Volume:: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Month:: May
Year:: 2022
Address:: Dublin, Ireland
Editors:: Emmanuele Chersoni, Nora Hollenstein, Cassandra Jacobs, Yohei Oseki, Laurent Prévot, Enrico Santus
Venue:: CMCL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36–42
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-2/2022.cmcl-1.4/
DOI:: 10.18653/v1/2022.cmcl-1.4
Bibkey:
Cite (ACL):: Ece Takmaz, Sandro Pezzelle, and Raquel Fernández. 2022. Less Descriptive yet Discriminative: Quantifying the Properties of Multimodal Referring Utterances via CLIP. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 36–42, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):: Less Descriptive yet Discriminative: Quantifying the Properties of Multimodal Referring Utterances via CLIP (Takmaz et al., CMCL 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/2022.cmcl-1.4.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-2/2022.cmcl-1.4.mp4

PDF Cite Search Video Fix data