Aida Kasieva
2025
The Kyrgyz Seed Dataset Submission to the WMT25 Open Language Data Initiative Shared Task
Murat Jumashev
|
Alina Tillabaeva
|
Aida Kasieva
|
Turgunbek Omurkanov
|
Akylai Musaeva
|
Meerim Emil Kyzy
|
Gulaiym Chagataeva
|
Jonathan Washington
Proceedings of the Tenth Conference on Machine Translation
We present a Kyrgyz language seed dataset as part of our contribution to the WMT25 Open Language Data Initiative (OLDI) shared task. This paper details the process of collecting and curating English–Kyrgyz translations, highlighting the main challenges encountered in translating into a morphologically rich, low-resource language. We demonstrate the quality of the dataset through fine-tuning experiments, showing consistent improvements in machine translation performance across multiple models. Comparisons with bilingual and MNMT Kyrgyz-English baselines reveal that, for some models, our dataset enables performance surpassing pretrained baselines in both English–Kyrgyz and Kyrgyz–English translation directions. These results validate the dataset’s utility and suggest that it can serve as a valuable resource for the Kyrgyz MT community and other related low-resource languages.
2024
Strategies for the Annotation of Pronominalised Locatives in Turkic Universal Dependency Treebanks
Jonathan Washington
|
Çağrı Çöltekin
|
Furkan Akkurt
|
Bermet Chontaeva
|
Soudabeh Eslami
|
Gulnura Jumalieva
|
Aida Kasieva
|
Aslı Kuzgun
|
Büşra Marşan
|
Chihiro Taguchi
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
As part of our efforts to develop unified Universal Dependencies (UD) guidelines for Turkic languages, we evaluate multiple approaches to a difficult morphosyntactic phenomenon, pronominal locative expressions formed by a suffix -ki. These forms result in multiple syntactic words, with potentially conflicting morphological features, and participating in different dependency relations. We describe multiple approaches to the problem in current (and upcoming) Turkic UD treebanks, and show that none of them offers a solution that satisfies a number of constraints we consider (including constraints imposed by UD guidelines). This calls for a compromise with the ‘least damage’ that should be adopted by most, if not all, Turkic treebanks. Our discussion of the phenomenon and various annotation approaches may also help treebanking efforts for other languages or language families with similar constructions.