Visual Referring Expression Recognition: What Do Systems Actually Learn?

Volkan Cirik; Louis-Philippe Morency; Taylor Berg-Kirkpatrick

doi:10.18653/v1/N18-2123

Visual Referring Expression Recognition: What Do Systems Actually Learn?

Volkan Cirik, Louis-Philippe Morency, Taylor Berg-Kirkpatrick

Abstract

We present an empirical analysis of state-of-the-art systems for referring expression recognition – the task of identifying the object in an image referred to by a natural language expression – with the goal of gaining insight into how these systems reason about language and vision. Surprisingly, we find strong evidence that even sophisticated and linguistically-motivated models for this task may ignore linguistic structure, instead relying on shallow correlations introduced by unintended biases in the data selection and annotation process. For example, we show that a system trained and tested on the input image without the input referring expression can achieve a precision of 71.2% in top-2 predictions. Furthermore, a system that predicts only the object category given the input can achieve a precision of 84.2% in top-2 predictions. These surprisingly positive results for what should be deficient prediction scenarios suggest that careful analysis of what our models are learning – and further, how our data is constructed – is critical as we seek to make substantive progress on grounded language tasks.

Anthology ID:: N18-2123
Volume:: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
Month:: June
Year:: 2018
Address:: New Orleans, Louisiana
Editors:: Marilyn Walker, Heng Ji, Amanda Stent
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 781–787
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-1/N18-2123/
DOI:: 10.18653/v1/N18-2123
Bibkey:
Cite (ACL):: Volkan Cirik, Louis-Philippe Morency, and Taylor Berg-Kirkpatrick. 2018. Visual Referring Expression Recognition: What Do Systems Actually Learn?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 781–787, New Orleans, Louisiana. Association for Computational Linguistics.
Cite (Informal):: Visual Referring Expression Recognition: What Do Systems Actually Learn? (Cirik et al., NAACL 2018)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/N18-2123.pdf

PDF Cite Search Fix data