GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Virginia Felkner; Jennifer Thompson; Jonathan May

doi:10.18653/v1/2024.acl-long.760

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction

Virginia Felkner, Jennifer Thompson, Jonathan May

Abstract

Social biases in LLMs are usually measured via bias benchmark datasets. Current benchmarks have limitations in scope, grounding, quality, and human effort required. Previous work has shown success with a community-sourced, rather than crowd-sourced, approach to benchmark development. However, this work still required considerable effort from annotators with relevant lived experience. This paper explores whether an LLM (specifically, GPT-3.5-Turbo) can assist with the task of developing a bias benchmark dataset from responses to an open-ended community survey. We also extend the previous work to a new community and set of biases: the Jewish community and antisemitism. Our analysis shows that GPT-3.5-Turbo has poor performance on this annotation task and produces unacceptable quality issues in its output. Thus, we conclude that GPT-3.5-Turbo is not an appropriate substitute for human annotation in sensitive tasks related to social biases, and that its use actually negates many of the benefits of community-sourcing bias benchmarks.

Anthology ID:: 2024.acl-long.760
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14104–14115
Language:
URL:: https://aclanthology.org/2024.acl-long.760
DOI:: 10.18653/v1/2024.acl-long.760
Bibkey:
Cite (ACL):: Virginia Felkner, Jennifer Thompson, and Jonathan May. 2024. GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14104–14115, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction (Felkner et al., ACL 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2024.acl-long.760.pdf

PDF Search