The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description

Nikolai Ilinykh, Sina Zarrieß, David Schlangen


Abstract
Image captioning models are typically trained on data that is collected from people who are asked to describe an image, without being given any further task context. As we argue here, this context independence is likely to cause problems for transferring to task settings in which image description is bound by task demands. We demonstrate that careful design of data collection is required to obtain image descriptions which are contextually bounded to a particular meta-level task. As a task, we use MeetUp!, a text-based communication game where two players have the goal of finding each other in a visual environment. To reach this goal, the players need to describe images representing their current location. We analyse a dataset from this domain and show that the nature of image descriptions found in MeetUp! is diverse, dynamic and rich with phenomena that are not present in descriptions obtained through a simple image captioning task, which we ran for comparison.
Anthology ID:
W18-6547
Volume:
Proceedings of the 11th International Conference on Natural Language Generation
Month:
November
Year:
2018
Address:
Tilburg University, The Netherlands
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
397–402
Language:
URL:
https://aclanthology.org/W18-6547
DOI:
10.18653/v1/W18-6547
Bibkey:
Cite (ACL):
Nikolai Ilinykh, Sina Zarrieß, and David Schlangen. 2018. The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description. In Proceedings of the 11th International Conference on Natural Language Generation, pages 397–402, Tilburg University, The Netherlands. Association for Computational Linguistics.
Cite (Informal):
The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description (Ilinykh et al., INLG 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/W18-6547.pdf