Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

Rishabh Adiga, Besmira Nushi, Varun Chandrasekaran


Abstract
We believe that analyzing attention is crucial for understanding bias in large language models (LLMs); in ambiguous comparative prompting frameworks, it provides insight into how the LLM distributes its focus across different entities, and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the “entity preference” of an LLM. We then propose ATLAS, a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct extensive experiments across 3 datasets, 4 models, and 4 baseline approaches. Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how ATLAS effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.34% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.
Anthology ID:
2025.acl-long.1281
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
26403–26423
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.acl-long.1281/
DOI:
Bibkey:
Cite (ACL):
Rishabh Adiga, Besmira Nushi, and Varun Chandrasekaran. 2025. Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26403–26423, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models (Adiga et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.acl-long.1281.pdf