On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasovic


Abstract
Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted language models, or do they combine they reason jointly over modalities?We investigate these questions in the context of self-rationalization (jointly generating task labels/answers and free-text explanations) of three tasks: (i) visual question answering in VQA-X, (ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment in E-SNLI-VE. We show that recent unimodal advances, CLIP image representations and scaling of language models, do not consistently improveself-rationalization in multimodal tasks. We find that no single model type works universally best across tasks, datasets, and finetuning data sizes. Our findings motivate the need for novel general backbones that move text generation from images and text beyond image captioning.
Anthology ID:
2022.findings-emnlp.194
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2644–2657
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.194
DOI:
10.18653/v1/2022.findings-emnlp.194
Bibkey:
Cite (ACL):
Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, and Ana Marasovic. 2022. On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2644–2657, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization (Palaskar et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2022.findings-emnlp.194.pdf
Software:
 2022.findings-emnlp.194.software.zip