StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models

Li Yang (李旸); Zhiding Xiao; Wenxin Huang; Xian Zhong

StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models

Li Yang, Zhiding Xiao, Wenxin Huang, Xian Zhong

Abstract

The rapid development of multimodal large language models (MLLMs) has positioned visual storytelling as a crucial area in content creation. However, existing models often struggle to maintain temporal, spatial, and narrative coherence across image sequences, and they frequently lack the depth and engagement of human-authored stories. To address these challenges, we propose Story with Large Language-and-Vision Alignment (StoryLLaVA), a novel framework for enhancing visual storytelling. Our approach introduces a topic-driven narrative optimizer that improves both the training data and MLLM models by integrating image descriptions, topic generation, and GPT-4-based refinements. Furthermore, we employ a preference-based ranked story sampling method that aligns model outputs with human storytelling preferences through positive-negative pairing. These two phases of the framework differ in their training methods: the former uses supervised fine-tuning, while the latter incorporates reinforcement learning with positive and negative sample pairs. Experimental results demonstrate that StoryLLaVA outperforms current models in visual relevance, coherence, and fluency, with LLM-based evaluations confirming the generation of richer and more engaging narratives. The enhanced dataset and model will be made publicly available soon.

Anthology ID:: 2025.coling-main.266
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3936–3951
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-main.266/
DOI:
Bibkey:
Cite (ACL):: Li Yang, Zhiding Xiao, Wenxin Huang, and Xian Zhong. 2025. StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 3936–3951, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models (Yang et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-main.266.pdf

PDF Cite Search Fix data