Gender Biases in Automatic Evaluation Metrics for Image Captioning

Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun Peng


Abstract
Model-based evaluation metrics (e.g., CLIPScore and GPTScore) have demonstrated decent correlations with human judgments in various language generation tasks. However, their impact on fairness remains largely unexplored. It is widely recognized that pretrained models can inadvertently encode societal biases, thus employing these models for evaluation purposes may inadvertently perpetuate and amplify biases. For example, an evaluation metric may favor the caption “a woman is calculating an account book” over “a man is calculating an account book,” even if the image only shows male accountants. In this paper, we conduct a systematic study of gender biases in model-based automatic evaluation metrics for image captioning tasks. We start by curating a dataset comprising profession, activity, and object concepts associated with stereotypical gender associations. Then, we demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations, as well as the propagation of biases to generation models through reinforcement learning. Finally, we present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments. Our dataset and framework lay the foundation for understanding the potential harm of model-based evaluation metrics, and facilitate future works to develop more inclusive evaluation metrics.
Anthology ID:
2023.emnlp-main.520
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8358–8375
Language:
URL:
https://aclanthology.org/2023.emnlp-main.520
DOI:
10.18653/v1/2023.emnlp-main.520
Bibkey:
Cite (ACL):
Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, and Nanyun Peng. 2023. Gender Biases in Automatic Evaluation Metrics for Image Captioning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8358–8375, Singapore. Association for Computational Linguistics.
Cite (Informal):
Gender Biases in Automatic Evaluation Metrics for Image Captioning (Qiu et al., EMNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2023.emnlp-main.520.pdf
Video:
 https://preview.aclanthology.org/ingest-2024-clasp/2023.emnlp-main.520.mp4