Turgunbek Omurkanov
2025
The Kyrgyz Seed Dataset Submission to the WMT25 Open Language Data Initiative Shared Task
Murat Jumashev
|
Alina Tillabaeva
|
Aida Kasieva
|
Turgunbek Omurkanov
|
Akylai Musaeva
|
Meerim Emil Kyzy
|
Gulaiym Chagataeva
|
Jonathan Washington
Proceedings of the Tenth Conference on Machine Translation
We present a Kyrgyz language seed dataset as part of our contribution to the WMT25 Open Language Data Initiative (OLDI) shared task. This paper details the process of collecting and curating English–Kyrgyz translations, highlighting the main challenges encountered in translating into a morphologically rich, low-resource language. We demonstrate the quality of the dataset through fine-tuning experiments, showing consistent improvements in machine translation performance across multiple models. Comparisons with bilingual and MNMT Kyrgyz-English baselines reveal that, for some models, our dataset enables performance surpassing pretrained baselines in both English–Kyrgyz and Kyrgyz–English translation directions. These results validate the dataset’s utility and suggest that it can serve as a valuable resource for the Kyrgyz MT community and other related low-resource languages.
Search
Fix author
Co-authors
- Gulaiym Chagataeva 1
- Meerim Emil Kyzy 1
- Murat Jumashev 1
- Aida Kasieva 1
- Akylai Musaeva 1
- show all...
Venues
- wmt1