Data-Efficient Adaptation of Multilingual LLMs to Ukrainian
Yurii Paniv, Bohdan Didenko, Mykola Haltiuk, Vladyslav Humennyy, Andrian Kravchenko, Roman Kyslyi, Viktoriia Makovska, Artem Orlovskyi, Bohdan Ruban, Maksym-Yurii Rudko, Anastasiia Senyk, Nazarii Drushchak, Dmytro Chaplynskyi, Mariana Romanyshyn
Abstract
Adapting large language models to low-resource languages presents three interconnected challenges: inefficient tokenization, scarcity of high-quality annotated data, and limited resources for instruction tuning. We present a reproducible approach that addresses each challenge using data-centric methods that primarily rely on unlabeled text corpora, parallel translation data, and a multilingual base model. Our approach combines (1) vocabulary surgery for tokenizer adaptation without full retraining, (2) cross-lingual transfer of quality classifiers via translation, enabling filtering without target-language annotations, and (3) generation of instruction data through translation, task conversion, and targeted synthesis. We validate this recipe by adapting Gemma-3-12B to Ukrainian. %, producing Lapa-12BOur pretrained model achieves top performance on Ukrainian benchmarks, while our instruction-tuned variant demonstrates strong performance on translation (33 BLEU on FLORES), summarization, and question-answering tasks, while requiring 1.5x fewer tokens than the original model for the same text. We release all models, datasets, classifiers, and code to enable replication for other languages.- Anthology ID:
- 2026.unlp-1.14
- Volume:
- Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
- Month:
- May
- Year:
- 2026
- Address:
- Lviv, Ukraine
- Editor:
- Mariana Romanyshyn
- Venue:
- UNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 155–168
- Language:
- URL:
- https://preview.aclanthology.org/bulk-corrections-2026-07-02/2026.unlp-1.14/
- DOI:
- Cite (ACL):
- Yurii Paniv, Bohdan Didenko, Mykola Haltiuk, Vladyslav Humennyy, Andrian Kravchenko, Roman Kyslyi, Viktoriia Makovska, Artem Orlovskyi, Bohdan Ruban, Maksym-Yurii Rudko, Anastasiia Senyk, Nazarii Drushchak, Dmytro Chaplynskyi, and Mariana Romanyshyn. 2026. Data-Efficient Adaptation of Multilingual LLMs to Ukrainian. In Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026), pages 155–168, Lviv, Ukraine. Association for Computational Linguistics.
- Cite (Informal):
- Data-Efficient Adaptation of Multilingual LLMs to Ukrainian (Paniv et al., UNLP 2026)
- PDF:
- https://preview.aclanthology.org/bulk-corrections-2026-07-02/2026.unlp-1.14.pdf