Surprisal reveals diversity gaps in image captioning and different scorers change the story

Nikolai Ilinykh, Simon Dobnik


Abstract
We quantify linguistic diversity in image captioning with surprisal variance – the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.
Anthology ID:
2025.inlg-main.22
Volume:
Proceedings of the 18th International Natural Language Generation Conference
Month:
October
Year:
2025
Address:
Hanoi, Vietnam
Editors:
Lucie Flek, Shashi Narayan, Lê Hồng Phương, Jiahuan Pei
Venue:
INLG
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
366–375
Language:
URL:
https://preview.aclanthology.org/author-page-you-zhang-rochester/2025.inlg-main.22/
DOI:
Bibkey:
Cite (ACL):
Nikolai Ilinykh and Simon Dobnik. 2025. Surprisal reveals diversity gaps in image captioning and different scorers change the story. In Proceedings of the 18th International Natural Language Generation Conference, pages 366–375, Hanoi, Vietnam. Association for Computational Linguistics.
Cite (Informal):
Surprisal reveals diversity gaps in image captioning and different scorers change the story (Ilinykh & Dobnik, INLG 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-you-zhang-rochester/2025.inlg-main.22.pdf