Abstract
Prior work has shown that word embeddings capture human stereotypes, including gender bias. However, there is a lack of studies testing the presence of specific gender bias categories in word embeddings across diverse domains. This paper aims to fill this gap by applying the WEAT bias detection method to four sets of word embeddings trained on corpora from four different domains: news, social networking, biomedical and a gender-balanced corpus extracted from Wikipedia (GAP). We find that some domains are definitely more prone to gender bias than others, and that the categories of gender bias present also vary for each set of word embeddings. We detect some gender bias in GAP. We also propose a simple but novel method for discovering new bias categories by clustering word embeddings. We validate this method through WEAT’s hypothesis testing mechanism and find it useful for expanding the relatively small set of well-known gender bias word categories commonly used in the literature.- Anthology ID:
- W19-3804
- Volume:
- Proceedings of the First Workshop on Gender Bias in Natural Language Processing
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Marta R. Costa-jussà, Christian Hardmeier, Will Radford, Kellie Webster
- Venue:
- GeBNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 25–32
- Language:
- URL:
- https://aclanthology.org/W19-3804
- DOI:
- 10.18653/v1/W19-3804
- Cite (ACL):
- Kaytlin Chaloner and Alfredo Maldonado. 2019. Measuring Gender Bias in Word Embeddings across Domains and Discovering New Gender Bias Word Categories. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 25–32, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Measuring Gender Bias in Word Embeddings across Domains and Discovering New Gender Bias Word Categories (Chaloner & Maldonado, GeBNLP 2019)
- PDF:
- https://preview.aclanthology.org/improve-issue-templates/W19-3804.pdf
- Code
- alfredomg/GeBNLP2019
- Data
- GAP Coreference Dataset