End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space

Pranava Swaroop Madhyastha; Josiah Wang; Lucia Specia

doi:10.18653/v1/W18-5455

End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space

Pranava Swaroop Madhyastha, Josiah Wang, Lucia Specia

Abstract

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn ‘distributional similarity’ in a multimodal feature space, by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the ‘image’ side of image captioning, and vary the input image representation but keep the RNN text generation model of a CNN-RNN constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) experience virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our experiments all point to one fact: that our distributional similarity hypothesis holds. We conclude that, regardless of the image representation, image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

Anthology ID:: W18-5455
Volume:: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
Month:: November
Year:: 2018
Address:: Brussels, Belgium
Editors:: Tal Linzen, Grzegorz Chrupała, Afra Alishahi
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 381–383
Language:
URL:: https://preview.aclanthology.org/iwcs-25-ingestion/W18-5455/
DOI:: 10.18653/v1/W18-5455
Bibkey:
Cite (ACL):: Pranava Swaroop Madhyastha, Josiah Wang, and Lucia Specia. 2018. End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 381–383, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):: End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space (Madhyastha et al., EMNLP 2018)
Copy Citation:
PDF:: https://preview.aclanthology.org/iwcs-25-ingestion/W18-5455.pdf

PDF Cite Search Fix data