What goes into a word: generating image descriptions with top-down spatial knowledge

Mehdi Ghanimifard; Simon Dobnik

doi:10.18653/v1/W19-8668

What goes into a word: generating image descriptions with top-down spatial knowledge

Abstract

Generating grounded image descriptions requires associating linguistic units with their corresponding visual clues. A common method is to train a decoder language model with attention mechanism over convolutional visual features. Attention weights align the stratified visual features arranged by their location with tokens, most commonly words, in the target description. However, words such as spatial relations (e.g. next to and under) are not directly referring to geometric arrangements of pixels but to complex geometric and conceptual representations. The aim of this paper is to evaluate what representations facilitate generating image descriptions with spatial relations and lead to better grounded language generation. In particular, we investigate the contribution of three different representational modalities in generating relational referring expressions: (i) pre-trained convolutional visual features, (ii) different top-down geometric relational knowledge between objects, and (iii) world knowledge captured by contextual embeddings in language models.

Anthology ID:: W19-8668
Volume:: Proceedings of the 12th International Conference on Natural Language Generation
Month:: October–November
Year:: 2019
Address:: Tokyo, Japan
Editors:: Kees van Deemter, Chenghua Lin, Hiroya Takamura
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 540–551
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-2/W19-8668/
DOI:: 10.18653/v1/W19-8668
Bibkey:
Cite (ACL):: Mehdi Ghanimifard and Simon Dobnik. 2019. What goes into a word: generating image descriptions with top-down spatial knowledge. In Proceedings of the 12th International Conference on Natural Language Generation, pages 540–551, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):: What goes into a word: generating image descriptions with top-down spatial knowledge (Ghanimifard & Dobnik, INLG 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/W19-8668.pdf
Supplementary attachment:: W19-8668.Supplementary_Attachment.pdf

PDF Cite Search Supplementary attachment Fix data