Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision

Sandro Pezzelle, Claudio Greco, Greta Gandolfi, Eleonora Gualdoni, Raffaella Bernardi


Abstract
This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.
Anthology ID:
2020.findings-emnlp.248
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2751–2767
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.248
DOI:
10.18653/v1/2020.findings-emnlp.248
Bibkey:
Cite (ACL):
Sandro Pezzelle, Claudio Greco, Greta Gandolfi, Eleonora Gualdoni, and Raffaella Bernardi. 2020. Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2751–2767, Online. Association for Computational Linguistics.
Cite (Informal):
Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision (Pezzelle et al., Findings 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2020.findings-emnlp.248.pdf
Data
MS COCOSWAGVisual Question Answering