Abstract
Recent advances in image and video creation, especially AI-based image synthesis, have led to the production of numerous visual scenes that exhibit a high level of abstractness and diversity. Consequently, Visual Storytelling (VST), a task that involves generating meaningful and coherent narratives from a collection of images, has become even more challenging and is increasingly desired beyond real-world imagery. While existing VST techniques, which typically use autoregressive decoders, have made significant progress, they suffer from low inference speed and are not well-suited for synthetic scenes. To this end, we propose a novel diffusion-based system DiffuVST, which models the generation of a series of visual descriptions as a single conditional denoising process. The stochastic and non-autoregressive nature of DiffuVST at inference time allows it to generate highly diverse narratives more efficiently. In addition, DiffuVST features a unique design with bi-directional text history guidance and multimodal adapter modules, which effectively improve inter-sentence coherence and image-to-text fidelity. Extensive experiments on the story generation task covering four fictional visual-story datasets demonstrate the superiority of DiffuVST over traditional autoregressive models in terms of both text quality and inference speed.- Anthology ID:
- 2023.findings-emnlp.126
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1885–1896
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.126
- DOI:
- 10.18653/v1/2023.findings-emnlp.126
- Cite (ACL):
- Shengguang Wu, Mei Yuan, and Qi Su. 2023. DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1885–1896, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models (Wu et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/emnlp-22-attachments/2023.findings-emnlp.126.pdf