Safety Is Not Universal: The Selective Safety Trap in LLM Alignment

Iago Alves Brito; Walcy Rios; Julia Soares Dollis; Diogo Fernandes Costa Silva; Arlindo Rodrigues Galvão Filho

Safety Is Not Universal: The Selective Safety Trap in LLM Alignment

Iago Alves Brito, Walcy Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galv\~ao Filho

Abstract

Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under generic categories such as "Identity Hate", obscuring vulnerabilities toward specific populations. In this work, we expose the Selective Safety Trap: a systemic failure mode where models robustly defend specific populations while leaving underrepresented communities highly vulnerable to identical adversarial attacks. To systematically audit this phenomenon, we introduce MiJaBench, a bilingual (English–Portuguese) adversarial benchmark comprising 43,961 controlled jailbreaking prompts across 16 minority groups. By evaluating 14 state-of-the-art LLMs on MiJaBench, we curate 615,454 prompt-response pairs that compose MiJaBench-Align, revealing that safety alignment is not a uniform semantic capability but a demographic hierarchy, with defense rates fluctuating by up to 42% within the same model solely based on the target group. This disparity persists across architectures and languages and is amplified by scaling, indicating that current alignment methods learn group-specific safeguards rather than a generalized notion of harm. Through targeted direct preference optimization (DPO) on a 1B-parameter baseline, we achieve strong zero-shot safety generalizations to entirely unseen demographics and complex attack strategies. We release all datasets and scripts to provide the community with a concrete pathway toward equitable, transferable safety alignment.

Anthology ID:: 2026.findings-acl.489
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10044–10065
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.489/
DOI:
Bibkey:
Cite (ACL):: Iago Alves Brito, Walcy Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, and Arlindo Rodrigues Galv\~ao Filho. 2026. Safety Is Not Universal: The Selective Safety Trap in LLM Alignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10044–10065, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Safety Is Not Universal: The Selective Safety Trap in LLM Alignment (Brito et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.489.pdf
Checklist:: 2026.findings-acl.489.checklist.pdf

PDF Cite Search Checklist Fix data