MULSUM: A Multimodal Summarization System with Vis-Aligner and Diversity-Aware Image Selection

Abid Ali, Diego Molla, Usman Naseem


Abstract
The abundance of multimodal news in digital form has intensified demand for systems that condense articles and images into concise, faithful digests. Yet most approaches simply conduct unimodal text summarization and attach the most-similar images with the text summary, which leads to redundancy both in processing visual content as well as in selection of images to complement the summary. We propose MULSUM, a two-step framework: (i) a Cross-Vis Aligner that projects image-level embeddings into a shared space and conditions a pre-trained LLM decoder to generate a visually informed text summary, and (ii) a Diversity-Aware Image Selector that, after the summary is produced, maximizes images-relevance to the summary while enforcing pairwise image diversity, yielding a compact, complementary image set. Experimental results on the benchmark MSMO (Multimodal Summarization with Multimodal Output) corpus show that MULSUM consistently outperforms strong baselines on automatic metrics such as ROUGE, while qualitative inspection shows that selected images act as explanatory evidence rather than ornamental add-ons. Human evaluation results shows that our diverse set of selected images was 13% more helpful than mere similarity-based image selection.
Anthology ID:
2026.eacl-long.16
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
351–362
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.16/
DOI:
Bibkey:
Cite (ACL):
Abid Ali, Diego Molla, and Usman Naseem. 2026. MULSUM: A Multimodal Summarization System with Vis-Aligner and Diversity-Aware Image Selection. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 351–362, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
MULSUM: A Multimodal Summarization System with Vis-Aligner and Diversity-Aware Image Selection (Ali et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.16.pdf