VISIAR: Empower MLLM for Visual Story Ideation

Zhaoyang Xia; Somdeb Sarkhel; Mehrab Tanjim; Stefano Petrangeli; Ishita Dasgupta; Yuxiao Chen; Jinxuan Xu; Di Liu; Saayan Mitra; Dimitris N. Metaxas

doi:10.18653/v1/2025.findings-acl.945

VISIAR: Empower MLLM for Visual Story Ideation

Zhaoyang Xia, Somdeb Sarkhel, Mehrab Tanjim, Stefano Petrangeli, Ishita Dasgupta, Yuxiao Chen, Jinxuan Xu, Di Liu, Saayan Mitra, Dimitris N. Metaxas

Abstract

Ideation, the process of forming ideas from concepts, is a big part of the content creation process. However, the noble goal of helping visual content creators by suggesting meaningful sequences of visual assets from a limited collection is challenging. It requires a nuanced understanding of visual assets and the integration of open-world knowledge to support creative exploration. Despite its importance, this task has yet to be explored fully in existing literature. To fill this gap, we propose Visual Story Ideation, a novel and underexplored task focused on the automated selection and arrangement of visual assets into coherent sequences that convey expressive storylines.We also present VISIAR, Visual Ideation through Sequence Integration and Asset Rearrangement, a robust framework leveraging Multimodal Large Language Models (MLLMs), and a novel Story Graph mechanism. Our framework operates in three key stages: visual content understanding, candidate asset selection, and asset rearrangement via MLLMs. In addition, we curated a new benchmark dataset, called VTravel, to evaluate our methods both qualitatively and quantitatively.User studies and GPT-as-the-judge evaluation show that our approach surpasses GPT-4o based baseline by an average of 33.5% and 18.5% across three different metrics, demonstrating the effectiveness of our framework for generating compelling visual stories.

Anthology ID:: 2025.findings-acl.945
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18384–18402
Language:
URL:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.945/
DOI:: 10.18653/v1/2025.findings-acl.945
Bibkey:
Cite (ACL):: Zhaoyang Xia, Somdeb Sarkhel, Mehrab Tanjim, Stefano Petrangeli, Ishita Dasgupta, Yuxiao Chen, Jinxuan Xu, Di Liu, Saayan Mitra, and Dimitris N. Metaxas. 2025. VISIAR: Empower MLLM for Visual Story Ideation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18384–18402, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: VISIAR: Empower MLLM for Visual Story Ideation (Xia et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.945.pdf

PDF Cite Search Fix data