TIGEr: Text-to-Image Grounding for Image Caption Evaluation

Ming Jiang; Qiuyuan Huang; Lei Zhang; Xin Wang; Pengchuan Zhang; Zhe Gan; Jana Diesner; Jianfeng Gao

doi:10.18653/v1/D19-1220

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, Jianfeng Gao

Abstract

This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions, potentially leading to biased evaluations because references may not fully cover the image content and natural language is inherently ambiguous. Building upon a machine-learned text-image grounding model, TIGEr allows to evaluate caption quality not only based on how well a caption represents image content, but also on how well machine-generated captions match human-generated captions. Our empirical tests show that TIGEr has a higher consistency with human judgments than alternative existing metrics. We also comprehensively assess the metric’s effectiveness in caption evaluation by measuring the correlation between human judgments and metric scores.

Anthology ID:: D19-1220
Volume:: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:: EMNLP | IJCNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2141–2152
Language:
URL:: https://aclanthology.org/D19-1220
DOI:: 10.18653/v1/D19-1220
Bibkey:
Cite (ACL):: Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang, Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jianfeng Gao. 2019. TIGEr: Text-to-Image Grounding for Image Caption Evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2141–2152, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: TIGEr: Text-to-Image Grounding for Image Caption Evaluation (Jiang et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/add_acl24_videos/D19-1220.pdf
Code: SeleenaJM/CapEval
Data: MS COCO

PDF Search Code