HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Michele Cafagna; Kees van Deemter; Albert Gatt

doi:10.18653/v1/2023.inlg-main.21

HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Michele Cafagna, Kees van Deemter, Albert Gatt

Abstract

Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. “people eating food in a park”. Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict (“people at a holiday resort”) and the actions they perform (“people having a picnic”). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.

Anthology ID:: 2023.inlg-main.21
Original:: 2023.inlg-main.21v1
Version 2:: 2023.inlg-main.21v2
Volume:: Proceedings of the 16th International Natural Language Generation Conference
Month:: September
Year:: 2023
Address:: Prague, Czechia
Editors:: C. Maria Keet, Hung-Yi Lee, Sina Zarrieß
Venues:: INLG | SIGDIAL
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 293–312
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.inlg-main.21/
DOI:: 10.18653/v1/2023.inlg-main.21
Bibkey:
Cite (ACL):: Michele Cafagna, Kees van Deemter, and Albert Gatt. 2023. HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales. In Proceedings of the 16th International Natural Language Generation Conference, pages 293–312, Prague, Czechia. Association for Computational Linguistics.
Cite (Informal):: HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales (Cafagna et al., INLG-SIGDIAL 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.inlg-main.21.pdf

PDF (v2) PDF (v1) Cite Search Fix data