@inproceedings{kapur-kreiss-2024-reference,
    title = "Reference-Based Metrics Are Biased Against Blind and Low-Vision Users' Image Description Preferences",
    author = "Kapur, Rhea  and
      Kreiss, Elisa",
    editor = "Dementieva, Daryna  and
      Ignat, Oana  and
      Jin, Zhijing  and
      Mihalcea, Rada  and
      Piatti, Giorgio  and
      Tetreault, Joel  and
      Wilson, Steven  and
      Zhao, Jieyu",
    booktitle = "Proceedings of the Third Workshop on NLP for Positive Impact",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2024.nlp4pi-1.26/",
    doi = "10.18653/v1/2024.nlp4pi-1.26",
    pages = "308--314",
    abstract = "Image description generation models are sophisticated Vision-Language Models which promise to make visual content, such as images, non-visually accessible through linguistic descriptions. While these systems can benefit all, their primary motivation tends to lie in allowing blind and low-vision (BLV) users access to increasingly visual (online) discourse. Well-defined evaluation methods are crucial for steering model development into socially useful directions. In this work, we show that the most popular evaluation metrics (reference-based metrics) are biased against BLV users and therefore potentially stifle useful model development. Reference-based metrics assign quality scores based on the similarity to human-generated ground-truth descriptions and are widely accepted as neutrally representing the needs of all users. However, we find that these metrics are more strongly correlated with sighted participant ratings than BLV ratings, and we explore factors which appear to mediate this finding: description length, the image{'}s context of appearance, and the number of reference descriptions available. These findings suggest that there is a need for developing evaluation methods that are established based on specific downstream user groups, and they highlight the importance of reflecting on emerging biases against minorities in the development of general-purpose automatic metrics."
}Markdown (Informal)
[Reference-Based Metrics Are Biased Against Blind and Low-Vision Users’ Image Description Preferences](https://preview.aclanthology.org/ingest-emnlp/2024.nlp4pi-1.26/) (Kapur & Kreiss, NLP4PI 2024)
ACL