On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasovic
Abstract
Combining the visual modality with pretrained language models has been surprisingly effective for simple descriptive tasks such as image captioning. More general text generation however remains elusive. We take a step back and ask: How do these models work for more complex generative tasks, i.e. conditioning on both text and images? Are multimodal models simply visually adapted language models, or do they combine they reason jointly over modalities?We investigate these questions in the context of self-rationalization (jointly generating task labels/answers and free-text explanations) of three tasks: (i) visual question answering in VQA-X, (ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment in E-SNLI-VE. We show that recent unimodal advances, CLIP image representations and scaling of language models, do not consistently improveself-rationalization in multimodal tasks. We find that no single model type works universally best across tasks, datasets, and finetuning data sizes. Our findings motivate the need for novel general backbones that move text generation from images and text beyond image captioning.- Anthology ID:
- 2022.findings-emnlp.194
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2022
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Editors:
- Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2644–2657
- Language:
- URL:
- https://aclanthology.org/2022.findings-emnlp.194
- DOI:
- 10.18653/v1/2022.findings-emnlp.194
- Cite (ACL):
- Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, and Ana Marasovic. 2022. On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2644–2657, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization (Palaskar et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2022.findings-emnlp.194.pdf