SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Xiaofu Chen, Israfel Salazar, Yova Kementchedjhieva


Abstract
As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development.We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.
Anthology ID:
2025.emnlp-main.477
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9406–9418
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.477/
DOI:
Bibkey:
Cite (ACL):
Xiaofu Chen, Israfel Salazar, and Yova Kementchedjhieva. 2025. SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9406–9418, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation (Chen et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.477.pdf
Checklist:
 2025.emnlp-main.477.checklist.pdf