MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training
Anastasia Dunca, Maanas Kumar Sharma, Olivia Munoz, Victor Rosales
Abstract
Jailbreaking, the phenomenon where specific prompts cause LLMs to assist with harmful requests, remains a critical challenge in NLP, particularly in non-English and lower-resourced languages. To address this, we introduce MULBERE, a method that extends the method of Targeted Latent Adversarial Training (T-LAT) to a multilingual context. We first create and share a multilingual jailbreak dataset spanning high-, medium-, and low-resource languages, and then fine-tune LLaMA-2-7b-chat with interleaved T-LAT for jailbreak robustness and chat examples for model performance. Our evaluations show that MULBERE reduces average multilingual jailbreak success rates by 75% compared to the base LLaMA safety training and 71% compared to English-only T-LAT while maintaining or improving standard LLM performance.- Anthology ID:
- 2025.winlp-main.27
- Volume:
- Proceedings of the 9th Widening NLP Workshop
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M'hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendrabikram Thapa, Wiem Ben Rim
- Venues:
- WiNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 175–181
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.winlp-main.27/
- DOI:
- Cite (ACL):
- Anastasia Dunca, Maanas Kumar Sharma, Olivia Munoz, and Victor Rosales. 2025. MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training. In Proceedings of the 9th Widening NLP Workshop, pages 175–181, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training (Dunca et al., WiNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.winlp-main.27.pdf