MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety

Yahan Yang, Soham Dan, Shuo Li, Dan Roth, Insup Lee


Abstract
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we introduce a multilingual guardrail with reasoning for prompt classification. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-based Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail, MrGuard, consistently outperforms recent baselines across both in-domain and out-of-domain languages by more than 15%. We also evaluate MrGuard’s robustness to multilingual variations, such as code-switching and low-resource language distractors in the prompt, and demonstrate that it preserves safety judgments under these challenging conditions. The multilingual reasoning capability of our guardrail enables it to generate explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.
Anthology ID:
2025.emnlp-main.1392
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27365–27384
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1392/
DOI:
Bibkey:
Cite (ACL):
Yahan Yang, Soham Dan, Shuo Li, Dan Roth, and Insup Lee. 2025. MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27365–27384, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety (Yang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1392.pdf
Checklist:
 2025.emnlp-main.1392.checklist.pdf