Abstract
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn ‘distributional similarity’ in a multimodal feature space, by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the ‘image’ side of image captioning, and vary the input image representation but keep the RNN text generation model of a CNN-RNN constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) experience virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our experiments all point to one fact: that our distributional similarity hypothesis holds. We conclude that, regardless of the image representation, image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.- Anthology ID:
- W18-5455
- Volume:
- Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
- Month:
- November
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Tal Linzen, Grzegorz Chrupała, Afra Alishahi
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 381–383
- Language:
- URL:
- https://aclanthology.org/W18-5455
- DOI:
- 10.18653/v1/W18-5455
- Cite (ACL):
- Pranava Swaroop Madhyastha, Josiah Wang, and Lucia Specia. 2018. End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 381–383, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space (Madhyastha et al., EMNLP 2018)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/W18-5455.pdf
- Code
- sheffieldnlp/whatIC