Data and Model Centric Approaches for Expansion of Large Language Models to New languages
Anoop Kunchukuttan, Raj Dabre, Rudra Murthy, Mohammed Safi Ur Rahman Khan, Thanmay Jayakumar
Abstract
Despite the increasing pace of Large Language Model (LLM) research, a vast majority of existing LLMs mainly support English alongside a handful of high resource languages, leaving a major gap for most low-resource languages. In this tutorial, we focus on approaches to expand the language coverage of LLMs. This provides an efficient and viable path to bring LLM technologies to low-resource languages, instead of training from scratch. We look at approaches at various stages of the LLM training pipeline, like tokenizer training, pre-training, instruction tuning, alignment, evaluation, etc., where adaptations are made to support new languages. We look at data-oriented approaches as well as model-oriented approaches. We hope that our tutorial enables researchers and practitioners to work on incorporating additional languages and tasks into existing LLMs to enhance inclusivity and coverage.- Anthology ID:
- 2025.emnlp-tutorials.5
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Valentina Pyatkin, Andreas Vlachos
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 12–13
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-tutorials.5/
- DOI:
- Cite (ACL):
- Anoop Kunchukuttan, Raj Dabre, Rudra Murthy, Mohammed Safi Ur Rahman Khan, and Thanmay Jayakumar. 2025. Data and Model Centric Approaches for Expansion of Large Language Models to New languages. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 12–13, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Data and Model Centric Approaches for Expansion of Large Language Models to New languages (Kunchukuttan et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-tutorials.5.pdf