What Makes for Good Image Captions?

Delong Chen; Samuel Cahyawijaya; Etsuko Ishii; Ho Shu Chan; Yejin Bang; Pascale Fung

doi:10.18653/v1/2025.findings-emnlp.75

What Makes for Good Image Captions?

Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, Pascale Fung

Abstract

This paper establishes a formal information-theoretic framework for image captioning, conceptualizing captions as compressed linguistic representations that selectively encode semantic units in images. Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans. By formulating these aspects as quantitative measures with adjustable weights, our framework provides a flexible foundation for analyzing and optimizing image captioning systems across diverse task requirements. To demonstrate its applicability, we introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information. We present both theoretical proof that PoCa improves caption quality under certain assumptions, and empirical validation of its effectiveness across various image captioning models and datasets.

Anthology ID:: 2025.findings-emnlp.75
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1420–1437
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.75/
DOI:: 10.18653/v1/2025.findings-emnlp.75
Bibkey:
Cite (ACL):: Delong Chen, Samuel Cahyawijaya, Etsuko Ishii, Ho Shu Chan, Yejin Bang, and Pascale Fung. 2025. What Makes for Good Image Captions?. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1420–1437, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: What Makes for Good Image Captions? (Chen et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.75.pdf
Checklist:: 2025.findings-emnlp.75.checklist.pdf

PDF Cite Search Checklist Fix data