Generating Contextual Images for Long-Form Text
Avijit Mitra, Nalin Gupta, Chetan Naik, Abhinav Sethy, Kinsey Bice, Zeynab Raeesy
Abstract
We investigate the problem of synthesizing relevant visual imagery from generic long-form text, leveraging Large Language Models (LLMs) and Text-to-Image Models (TIMs). Current Text-to-Image models require short prompts that describe the image content and style explicitly. Unlike image prompts, generation of images from general long-form text requires the image synthesis system to derive the visual content and style elements from the text. In this paper, we study zero-shot prompting and supervised fine-tuning approaches that use LLMs and TIMs jointly for synthesizing images. We present an empirical study on generating images for Wikipedia articles covering a broad spectrum of topic and image styles. We compare these systems using a suite of metrics, including a novel metric specifically designed to evaluate the semantic correctness of generated images. Our study offers a preliminary understanding of existing models’ strengths and limitation for the task of image generation from long-form text, and sets up an evaluation framework and establishes baselines for future research.- Anthology ID:
- 2024.lrec-main.673
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 7623–7633
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.673
- DOI:
- Cite (ACL):
- Avijit Mitra, Nalin Gupta, Chetan Naik, Abhinav Sethy, Kinsey Bice, and Zeynab Raeesy. 2024. Generating Contextual Images for Long-Form Text. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7623–7633, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Generating Contextual Images for Long-Form Text (Mitra et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.lrec-main.673.pdf