Understanding Large Language Model Vulnerabilities to Social Bias Attacks

Jiaxu Zhao, Meng Fang, Fanghua Ye, Ke Xu, Qin Zhang, Joey Tianyi Zhou, Mykola Pechenizkiy


Abstract
Large Language Models (LLMs) have become foundational in human-computer interaction, demonstrating remarkable linguistic capabilities across various tasks. However, there is a growing concern about their potential to perpetuate social biases present in their training data. In this paper, we comprehensively investigate the vulnerabilities of contemporary LLMs to various social bias attacks, including prefix injection, refusal suppression, and learned attack prompts. We evaluate popular models such as LLaMA-2, GPT-3.5, and GPT-4 across gender, racial, and religious bias types. Our findings reveal that models are generally more susceptible to gender bias attacks compared to racial or religious biases. We also explore novel aspects such as cross-bias and multiple-bias attacks, finding varying degrees of transferability across bias types. Additionally, our results show that larger models and pretrained base models often exhibit higher susceptibility to bias attacks. These insights contribute to the development of more inclusive and ethically responsible LLMs, emphasizing the importance of understanding and mitigating potential bias vulnerabilities. We offer recommendations for model developers and users to enhance the robustness of LLMs against social bias attacks.
Anthology ID:
2025.acl-long.862
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17620–17636
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.862/
DOI:
Bibkey:
Cite (ACL):
Jiaxu Zhao, Meng Fang, Fanghua Ye, Ke Xu, Qin Zhang, Joey Tianyi Zhou, and Mykola Pechenizkiy. 2025. Understanding Large Language Model Vulnerabilities to Social Bias Attacks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17620–17636, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Understanding Large Language Model Vulnerabilities to Social Bias Attacks (Zhao et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.862.pdf