Gender Encoding Patterns in Pretrained Language Model Representations

Mahdi Zakizadeh; Mohammad Taher Pilehvar

doi:10.18653/v1/2025.trustnlp-main.31

Gender Encoding Patterns in Pretrained Language Model Representations

Mahdi Zakizadeh, Mohammad Taher Pilehvar

Abstract

Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures.We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases.Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.

Anthology ID:: 2025.trustnlp-main.31
Volume:: Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 489–500
Language:
URL:: https://preview.aclanthology.org/moar-dois/2025.trustnlp-main.31/
DOI:: 10.18653/v1/2025.trustnlp-main.31
Bibkey:
Cite (ACL):: Mahdi Zakizadeh and Mohammad Taher Pilehvar. 2025. Gender Encoding Patterns in Pretrained Language Model Representations. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 489–500, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Gender Encoding Patterns in Pretrained Language Model Representations (Zakizadeh & Pilehvar, TrustNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/moar-dois/2025.trustnlp-main.31.pdf

PDF Cite Search Fix data