Waleed Jamil
2026
Code-Switching as a Safety Failure Mode in Large Language Models: An Empirical Study of Roman Urdu across English, Mixed, and Transliteration-Only Inputs
Waleed Jamil | Saima Rafi
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Waleed Jamil | Saima Rafi
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Large Language Models exhibit robust safety alignment when harmful intent is expressed in English, yet their resilience to code-switching and transliteration remains underexplored. This paper presents the first targeted investigation of code-switching as a safety failure mode, focusing on Roman Urdu—a widely used transliterated form common in informal and emotionally expressive communication. We introduce the Roman Urdu Adversarial Benchmark (RUAB), a semantically controlled evaluation benchmark designed to isolate linguistic variation from intent across four safety-critical categories: passive suicidal ideation, psychological distress, threat or intimidation, and coercion or emotional manipulation. Evaluating seven state-of-the-art models, we find that safety detection degrades consistently in code-switched and transliterated inputs, with the most pronounced failures occurring for passive suicidal ideation. Instruction-tuned and reasoning-capable models demonstrate greater robustness, suggesting these failures reflect alignment gaps rather than inherent model limitations. Our findings highlight transliteration and code-switching as under-recognized safety risks and motivate the development of linguistically inclusive, transliteration-aware safety methods.