Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?

Goncalo Emanuel Cavaco Gomes; Chrysoula Zerva; Bruno Martins

Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?

Goncalo Emanuel Cavaco Gomes, Chrysoula Zerva, Bruno Martins

Abstract

The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages, and additional tests with natively multilingual and multicultural data further attest to the high-quality assessments.

Anthology ID:: 2025.findings-naacl.287
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5156–5175
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.287/
DOI:
Bibkey:
Cite (ACL):: Goncalo Emanuel Cavaco Gomes, Chrysoula Zerva, and Bruno Martins. 2025. Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 5156–5175, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Evaluation of Multilingual Image Captioning: How far can we get with CLIP models? (Gomes et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.287.pdf

PDF Cite Search Fix data