Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Mahdi Zakizadeh; Mohammad Taher Pilehvar

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Mahdi Zakizadeh, Mohammad Taher Pilehvar

Abstract

Accurately measuring gender stereotypical bias in language models is a complex task with many hidden aspects. Current benchmarks have underestimated this multifaceted challenge and failed to capture the full extent of the problem. This paper examines the inconsistencies between intrinsic stereotype benchmarks. We propose that currently available benchmarks each capture only partial facets of gender stereotypes, and when considered in isolation, they provide just a fragmented view of the broader landscape of bias in language models. Using StereoSet and CrowS-Pairs as case studies, we investigated how data distribution affects benchmark results. By applying a framework from social psychology to balance the data of these benchmarks across various components of gender stereotypes, we demonstrated that even simple balancing techniques can significantly improve the correlation between different measurement approaches. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.

Anthology ID:: 2025.emnlp-main.1162
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22838–22851
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1162/
DOI:
Bibkey:
Cite (ACL):: Mahdi Zakizadeh and Mohammad Taher Pilehvar. 2025. Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22838–22851, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets (Zakizadeh & Pilehvar, EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1162.pdf
Checklist:: 2025.emnlp-main.1162.checklist.pdf

PDF Cite Search Checklist Fix data