Image Embedding Sampling Method for Diverse Captioning

Sania Waheed, Na Min An


Abstract
Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, comparably smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset, respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions. Our code is available at https://github.com/xfactlab/HBoP.
Anthology ID:
2025.emnlp-main.156
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3141–3157
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.156/
DOI:
Bibkey:
Cite (ACL):
Sania Waheed and Na Min An. 2025. Image Embedding Sampling Method for Diverse Captioning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3141–3157, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Image Embedding Sampling Method for Diverse Captioning (Waheed & An, EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.156.pdf
Checklist:
 2025.emnlp-main.156.checklist.pdf