Understanding Large Language Model Vulnerabilities to Social Bias Attacks
Jiaxu Zhao, Meng Fang, Fanghua Ye, Ke Xu, Qin Zhang, Joey Tianyi Zhou, Mykola Pechenizkiy
Abstract
Large Language Models (LLMs) have become foundational in human-computer interaction, demonstrating remarkable linguistic capabilities across various tasks. However, there is a growing concern about their potential to perpetuate social biases present in their training data. In this paper, we comprehensively investigate the vulnerabilities of contemporary LLMs to various social bias attacks, including prefix injection, refusal suppression, and learned attack prompts. We evaluate popular models such as LLaMA-2, GPT-3.5, and GPT-4 across gender, racial, and religious bias types. Our findings reveal that models are generally more susceptible to gender bias attacks compared to racial or religious biases. We also explore novel aspects such as cross-bias and multiple-bias attacks, finding varying degrees of transferability across bias types. Additionally, our results show that larger models and pretrained base models often exhibit higher susceptibility to bias attacks. These insights contribute to the development of more inclusive and ethically responsible LLMs, emphasizing the importance of understanding and mitigating potential bias vulnerabilities. We offer recommendations for model developers and users to enhance the robustness of LLMs against social bias attacks.- Anthology ID:
- 2025.acl-long.862
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 17620–17636
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.862/
- DOI:
- Cite (ACL):
- Jiaxu Zhao, Meng Fang, Fanghua Ye, Ke Xu, Qin Zhang, Joey Tianyi Zhou, and Mykola Pechenizkiy. 2025. Understanding Large Language Model Vulnerabilities to Social Bias Attacks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17620–17636, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Understanding Large Language Model Vulnerabilities to Social Bias Attacks (Zhao et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.862.pdf