Eleonora Gualdoni

2022

pdf
Horse or pony? Visual typicality and lexical frequency affect variability in object naming
Eleonora Gualdoni | Andreas Madebach | Thomas Brochhagen | Gemma Boleda
Proceedings of the Society for Computation in Linguistics 2022

pdf abs
Communication breakdown: On the low mutual intelligibility between human and neural captioning
Roberto Dessì | Eleonora Gualdoni | Francesca Franzon | Gemma Boleda | Marco Baroni
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We compare the 0-shot performance of a neural caption-based image retriever when given as input either human-produced captions or captions generated by a neural captioner. We conduct this comparison on the recently introduced ImageCoDe data-set (Krojer et al. 2022), which contains hard distractors nearly identical to the images to be retrieved. We find that the neural retriever has much higher performance when fed neural rather than human captions, despite the fact that the former, unlike the latter, were generated without awareness of the distractors that make the task hard. Even more remarkably, when the same neural captions are given to human subjects, their retrieval performance is almost at chance level. Our results thus add to the growing body of evidence that, even when the “language” of neural models resembles English, this superficial resemblance might be deeply misleading.

2020

pdf abs
Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision
Sandro Pezzelle | Claudio Greco | Greta Gandolfi | Eleonora Gualdoni | Raffaella Bernardi
Findings of the Association for Computational Linguistics: EMNLP 2020

This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.

Co-authors