Visual Referring Expression Recognition: What Do Systems Actually Learn?
Volkan Cirik, Louis-Philippe Morency, Taylor Berg-Kirkpatrick
Abstract
We present an empirical analysis of state-of-the-art systems for referring expression recognition – the task of identifying the object in an image referred to by a natural language expression – with the goal of gaining insight into how these systems reason about language and vision. Surprisingly, we find strong evidence that even sophisticated and linguistically-motivated models for this task may ignore linguistic structure, instead relying on shallow correlations introduced by unintended biases in the data selection and annotation process. For example, we show that a system trained and tested on the input image without the input referring expression can achieve a precision of 71.2% in top-2 predictions. Furthermore, a system that predicts only the object category given the input can achieve a precision of 84.2% in top-2 predictions. These surprisingly positive results for what should be deficient prediction scenarios suggest that careful analysis of what our models are learning – and further, how our data is constructed – is critical as we seek to make substantive progress on grounded language tasks.- Anthology ID:
- N18-2123
- Volume:
- Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
- Month:
- June
- Year:
- 2018
- Address:
- New Orleans, Louisiana
- Editors:
- Marilyn Walker, Heng Ji, Amanda Stent
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 781–787
- Language:
- URL:
- https://aclanthology.org/N18-2123
- DOI:
- 10.18653/v1/N18-2123
- Cite (ACL):
- Volkan Cirik, Louis-Philippe Morency, and Taylor Berg-Kirkpatrick. 2018. Visual Referring Expression Recognition: What Do Systems Actually Learn?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 781–787, New Orleans, Louisiana. Association for Computational Linguistics.
- Cite (Informal):
- Visual Referring Expression Recognition: What Do Systems Actually Learn? (Cirik et al., NAACL 2018)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/N18-2123.pdf
- Code
- volkancirik/neural-sieves-refexp
- Data
- MS COCO