MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training

Anastasia Dunca, Maanas Kumar Sharma, Olivia Munoz, Victor Rosales


Abstract
Jailbreaking, the phenomenon where specific prompts cause LLMs to assist with harmful requests, remains a critical challenge in NLP, particularly in non-English and lower-resourced languages. To address this, we introduce MULBERE, a method that extends the method of Targeted Latent Adversarial Training (T-LAT) to a multilingual context. We first create and share a multilingual jailbreak dataset spanning high-, medium-, and low-resource languages, and then fine-tune LLaMA-2-7b-chat with interleaved T-LAT for jailbreak robustness and chat examples for model performance. Our evaluations show that MULBERE reduces average multilingual jailbreak success rates by 75% compared to the base LLaMA safety training and 71% compared to English-only T-LAT while maintaining or improving standard LLM performance.
Anthology ID:
2025.winlp-main.27
Volume:
Proceedings of the 9th Widening NLP Workshop
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M'hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendrabikram Thapa, Wiem Ben Rim
Venues:
WiNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
175–181
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.winlp-main.27/
DOI:
Bibkey:
Cite (ACL):
Anastasia Dunca, Maanas Kumar Sharma, Olivia Munoz, and Victor Rosales. 2025. MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training. In Proceedings of the 9th Widening NLP Workshop, pages 175–181, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training (Dunca et al., WiNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.winlp-main.27.pdf