Layered Bias: Interpreting Bias in Pretrained Large Language Models

Nirmalendu Prakash; Roy Ka-Wei Lee

doi:10.18653/v1/2023.blackboxnlp-1.22

Layered Bias: Interpreting Bias in Pretrained Large Language Models

Abstract

Large language models (LLMs) like GPT and PALM have excelled in numerous natural language processing (NLP) tasks such as text generation, question answering, and translation. However, they are also found to have inherent social biases. To address this, recent studies have proposed debiasing techniques like iterative nullspace projection (INLP) and Counterfactual Data Augmentation (CDA). Additionally, there’s growing interest in understanding the intricacies of these models. Some researchers focus on individual neural units, while others examine specific layers. In our study, we benchmark newly released models, assess the impact of debiasing methods, and investigate how biases are linked to different transformer layers using a method called Logit Lens. Specifically, we evaluate three modern LLMs: OPT, LLaMA, and LLaMA2, and their debiased versions. Our experiments are based on two popular bias evaluation datasets, StereoSet and CrowS-Pairs, and we perform a layer-by-layer analysis using the Logit Lens.

Anthology ID:: 2023.blackboxnlp-1.22
Volume:: Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, Hosein Mohebbi
Venues:: BlackboxNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 284–295
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2023.blackboxnlp-1.22/
DOI:: 10.18653/v1/2023.blackboxnlp-1.22
Bibkey:
Cite (ACL):: Nirmalendu Prakash and Roy Ka-Wei Lee. 2023. Layered Bias: Interpreting Bias in Pretrained Large Language Models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 284–295, Singapore. Association for Computational Linguistics.
Cite (Informal):: Layered Bias: Interpreting Bias in Pretrained Large Language Models (Prakash & Lee, BlackboxNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2023.blackboxnlp-1.22.pdf

PDF Cite Search Fix data