Maksym-Yurii Rudko
2026
Data-Efficient Adaptation of Multilingual LLMs to Ukrainian
Yurii Paniv | Bohdan Didenko | Mykola Haltiuk | Vladyslav Humennyy | Andrian Kravchenko | Roman Kyslyi | Viktoriia Makovska | Artem Orlovskyi | Bohdan Ruban | Maksym-Yurii Rudko | Anastasiia Senyk | Nazarii Drushchak | Dmytro Chaplynskyi | Mariana Romanyshyn
Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
Yurii Paniv | Bohdan Didenko | Mykola Haltiuk | Vladyslav Humennyy | Andrian Kravchenko | Roman Kyslyi | Viktoriia Makovska | Artem Orlovskyi | Bohdan Ruban | Maksym-Yurii Rudko | Anastasiia Senyk | Nazarii Drushchak | Dmytro Chaplynskyi | Mariana Romanyshyn
Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
Adapting large language models to low-resource languages presents three interconnected challenges: inefficient tokenization, scarcity of high-quality annotated data, and limited resources for instruction tuning. We present a reproducible approach that addresses each challenge using data-centric methods that primarily rely on unlabeled text corpora, parallel translation data, and a multilingual base model. Our approach combines (1) vocabulary surgery for tokenizer adaptation without full retraining, (2) cross-lingual transfer of quality classifiers via translation, enabling filtering without target-language annotations, and (3) generation of instruction data through translation, task conversion, and targeted synthesis. We validate this recipe by adapting Gemma-3-12B to Ukrainian. %, producing Lapa-12BOur pretrained model achieves top performance on Ukrainian benchmarks, while our instruction-tuned variant demonstrates strong performance on translation (33 BLEU on FLORES), summarization, and question-answering tasks, while requiring 1.5x fewer tokens than the original model for the same text. We release all models, datasets, classifiers, and code to enable replication for other languages.