Rituraj Joshi
2025
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Nurkhan Laiyk
|
Daniil Orel
|
Rituraj Joshi
|
Maiya Goloburda
|
Yuxia Wang
|
Preslav Nakov
|
Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs’ understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
Search
Fix author
Co-authors
- Maiya Goloburda 1
- Fajri Koto 1
- Nurkhan Laiyk 1
- Preslav Nakov 1
- Daniil Orel 1
- show all...
Venues
- acl1