M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models

Rishabh Maheshwary, Vikas Yadav, Hoang H Nguyen, Khyati Mahajan, Sathwik Tejaswi Madhusudhan


Abstract
Collecting instruction fine-tuning (IFT) data is a resource and time intensive task especially in multilingual setting where finding proficient native speakers is challenging. Moreover, traditional data collection is prone to privacy risks, toxicity and lacks scalability. While, fully synthetic datasets are a promising alternative, research on their use in multilingual domain is limited as existing approaches still rely on machine translation to improve multilingual performance. To bridge this gap we introduce M2Lingual, the first fully synthetic, multi-turn multilingual dataset having 175K conversations across 70 languages with a balanced mix of high, low and mid-resourced languages. M2Lingual is constructed using a cost-efficient and scalable method that uses our novel two-step Evol prompt taxonomy to transform a small set of human written instructions to complex and challenging conversations. Results across three model families, six baseline datasets and evaluation spanning 31 languages demonstrates the effectiveness of M2Lingual over other datasets.
Anthology ID:
2025.naacl-long.489
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9676–9713
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.489/
DOI:
Bibkey:
Cite (ACL):
Rishabh Maheshwary, Vikas Yadav, Hoang H Nguyen, Khyati Mahajan, and Sathwik Tejaswi Madhusudhan. 2025. M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 9676–9713, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models (Maheshwary et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.489.pdf