Every picture tells a story: Image-grounded controllable stylistic story generation
Holy Lovenia, Bryan Wilie, Romain Barraud, Samuel Cahyawijaya, Willy Chung, Pascale Fung
Abstract
Generating a short story out of an image is arduous. Unlike image captioning, story generation from an image poses multiple challenges: preserving the story coherence, appropriately assessing the quality of the story, steering the generated story into a certain style, and addressing the scarcity of image-story pair reference datasets limiting supervision during training. In this work, we introduce Plug-and-Play Story Teller (PPST) and improve image-to-story generation by: 1) alleviating the data scarcity problem by incorporating large pre-trained models, namely CLIP and GPT-2, to facilitate a fluent image-to-text generation with minimal supervision, and 2) enabling a more style-relevant generation by incorporating stylistic adapters to control the story generation. We conduct image-to-story generation experiments with non-styled, romance-styled, and action-styled PPST approaches and compare our generated stories with those of previous work over three aspects, i.e., story coherence, image-story relevance, and style fitness, using both automatic and human evaluation. The results show that PPST improves story coherence and has better image-story relevance, but has yet to be adequately stylistic.- Anthology ID:
- 2022.latechclfl-1.6
- Volume:
- Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Editors:
- Stefania Degaetano, Anna Kazantseva, Nils Reiter, Stan Szpakowicz
- Venue:
- LaTeCHCLfL
- SIG:
- SIGHUM
- Publisher:
- International Conference on Computational Linguistics
- Note:
- Pages:
- 40–52
- Language:
- URL:
- https://aclanthology.org/2022.latechclfl-1.6
- DOI:
- Cite (ACL):
- Holy Lovenia, Bryan Wilie, Romain Barraud, Samuel Cahyawijaya, Willy Chung, and Pascale Fung. 2022. Every picture tells a story: Image-grounded controllable stylistic story generation. In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 40–52, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
- Cite (Informal):
- Every picture tells a story: Image-grounded controllable stylistic story generation (Lovenia et al., LaTeCHCLfL 2022)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2022.latechclfl-1.6.pdf
- Data
- BookCorpus, MS COCO