MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling

Hyeyeon Kim; Sungwoo Han; Jingun Kwon; Hidetaka Kamigaito; Manabu Okumura

MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling

Hyeyeon Kim, Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura

Abstract

In this study, we introduce a novel cover image generation task that produces both a concise summary and a visually corresponding image from a text-only document. Because no existing datasets are available for this task, we propose a multimodal pseudo-labeling method to construct high-quality datasets at low cost. We first collect documents with summaries, multiple images, and captions, and then exclude factually inconsistent instances. Our approach selects one image from multiple images accompanying each document. Using the gold summary, we independently rank both the images and their captions. Then, we annotate a pseudo-label for an image when both the image and its corresponding caption are ranked first in their respective rankings. Finally, we remove documents that contain direct image references within texts. Experimental results demonstrate that the proposed multimodal pseudo-labeling method constructs more precise datasets and generates higher quality images than text- and image-only pseudo-labeling methods, which consider captions and images separately.

Anthology ID:: 2026.lrec-main.719
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 9150–9161
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.719/
DOI:
Bibkey:
Cite (ACL):: Hyeyeon Kim, Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, and Manabu Okumura. 2026. MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling. International Conference on Language Resources and Evaluation, main:9150–9161.
Cite (Informal):: MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling (Kim et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.719.pdf

PDF Cite Search Fix data