Oleksiy Syvokon
2026
Dictionary-Based Speculative Decoding for Non-Latin-Script Languages
Oleksiy Syvokon
Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
Oleksiy Syvokon
Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)
Large language models tokenize non-Latin-script languagesinefficiently: a single word in Ukrainian or Crimean Tatar is split intotwo to three times as many tokens as its English equivalent. We propose_dictionary-based speculative decoding_ (DictSpec), which acceleratesinference by proposing draft continuations from a static n-gram lookuptable built offline from an unlabeled corpus. The lookup table requiresno trainable parameters or GPU resources, is inexpensive to construct,adds under 5 MB of memory overhead, and can be reused across modelsthat share a tokenizer. We evaluate DictSpec on Ukrainian and Crimean Tatar(Cyrillic and Latin scripts), implementing a vLLM plugin to benchmarkfive models ranging from 3B to 70B parameters on consumer- andserver-grade GPUs. In controlled emulation, DictSpec reduces verificationsteps by up to 1.65×, with gains correlating substantially with tokenizerfertility. In live vLLM serving, pure DictSpec gives modest speedups,while a hybrid with prompt-local n-gram speculation reaches up to 1.76×.We release our code and vLLM plugin as opensource.
2024
The UNLP 2024 Shared Task on Fine-Tuning Large Language Models for Ukrainian
Mariana Romanyshyn | Oleksiy Syvokon | Roman Kyslyi
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024
Mariana Romanyshyn | Oleksiy Syvokon | Roman Kyslyi
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024
This paper presents the results of the UNLP 2024 shared task, the first Shared Task on Fine-Tuning Large Language Models for the Ukrainian language. The goal of the task was to facilitate the creation of models that have knowledge of the Ukrainian language, history, and culture, as well as common knowledge, and are capable of generating fluent and accurate responses in Ukrainian. The participants were required to use models with open weights and reasonable size to ensure the reproducibility of the solutions. The participating systems were evaluated using multiple-choice exam questions and manually crafted open questions. Three teams submitted their solutions before the deadline, and two teams submitted papers that were accepted to appear in the UNLP workshop proceedings and are referred to in this report. The Codabench leaderboard is left open for further submissions.
2023
The UNLP 2023 Shared Task on Grammatical Error Correction for Ukrainian
Oleksiy Syvokon | Mariana Romanyshyn
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
Oleksiy Syvokon | Mariana Romanyshyn
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
This paper presents the results of the UNLP 2023 shared task, the first Shared Task on Grammatical Error Correction for the Ukrainian language. The task included two tracks: GEC-only and GEC+Fluency. The dataset and evaluation scripts were provided to the participants, and the final results were evaluated on a hidden test set. Six teams submitted their solutions before the deadline, and four teams submitted papers that were accepted to appear in the UNLP workshop proceedings and are referred to in this report. The CodaLab leaderboard is left open for further submissions.
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Oleksiy Syvokon | Olena Nahorna | Pavlo Kuchmiichuk | Nastasiia Osidach
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
Oleksiy Syvokon | Olena Nahorna | Pavlo Kuchmiichuk | Nastasiia Osidach
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
We present a corpus professionally annotated for grammatical error correction (GEC) and fluency edits in the Ukrainian language. We have built two versions of the corpus – GEC+Fluency and GEC-only – to differentiate the corpus application. To the best of our knowledge, this is the first GEC corpus for the Ukrainian language. We collected texts with errors (33,735 sentences) from a diverse pool of contributors, including both native and non-native speakers. The data cover a wide variety of writing domains, from text chats and essays to formal writing. Professional proofreaders corrected and annotated the corpus for errors relating to fluency, grammar, punctuation, and spelling. This corpus can be used for developing and evaluating GEC systems in Ukrainian. More generally, it can be used for researching multilingual and low-resource NLP, morphologically rich languages, document-level GEC, and fluency correction. The corpus is publicly available at https://github.com/grammarly/ua-gec