Every picture tells a story: Image-grounded controllable stylistic story generation

Holy Lovenia, Bryan Wilie, Romain Barraud, Samuel Cahyawijaya, Willy Chung, Pascale Fung


Abstract
Generating a short story out of an image is arduous. Unlike image captioning, story generation from an image poses multiple challenges: preserving the story coherence, appropriately assessing the quality of the story, steering the generated story into a certain style, and addressing the scarcity of image-story pair reference datasets limiting supervision during training. In this work, we introduce Plug-and-Play Story Teller (PPST) and improve image-to-story generation by: 1) alleviating the data scarcity problem by incorporating large pre-trained models, namely CLIP and GPT-2, to facilitate a fluent image-to-text generation with minimal supervision, and 2) enabling a more style-relevant generation by incorporating stylistic adapters to control the story generation. We conduct image-to-story generation experiments with non-styled, romance-styled, and action-styled PPST approaches and compare our generated stories with those of previous work over three aspects, i.e., story coherence, image-story relevance, and style fitness, using both automatic and human evaluation. The results show that PPST improves story coherence and has better image-story relevance, but has yet to be adequately stylistic.
Anthology ID:
2022.latechclfl-1.6
Volume:
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
LaTeCHCLfL
SIG:
SIGHUM
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
40–52
Language:
URL:
https://aclanthology.org/2022.latechclfl-1.6
DOI:
Bibkey:
Cite (ACL):
Holy Lovenia, Bryan Wilie, Romain Barraud, Samuel Cahyawijaya, Willy Chung, and Pascale Fung. 2022. Every picture tells a story: Image-grounded controllable stylistic story generation. In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 40–52, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):
Every picture tells a story: Image-grounded controllable stylistic story generation (Lovenia et al., LaTeCHCLfL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.latechclfl-1.6.pdf
Data
BookCorpusCOCO