Oleksiy Syvokon


2026

Large language models tokenize non-Latin-script languagesinefficiently: a single word in Ukrainian or Crimean Tatar is split intotwo to three times as many tokens as its English equivalent. We propose_dictionary-based speculative decoding_ (DictSpec), which acceleratesinference by proposing draft continuations from a static n-gram lookuptable built offline from an unlabeled corpus. The lookup table requiresno trainable parameters or GPU resources, is inexpensive to construct,adds under 5 MB of memory overhead, and can be reused across modelsthat share a tokenizer. We evaluate DictSpec on Ukrainian and Crimean Tatar(Cyrillic and Latin scripts), implementing a vLLM plugin to benchmarkfive models ranging from 3B to 70B parameters on consumer- andserver-grade GPUs. In controlled emulation, DictSpec reduces verificationsteps by up to 1.65×, with gains correlating substantially with tokenizerfertility. In live vLLM serving, pure DictSpec gives modest speedups,while a hybrid with prompt-local n-gram speculation reaches up to 1.76×.We release our code and vLLM plugin as opensource.

2024

This paper presents the results of the UNLP 2024 shared task, the first Shared Task on Fine-Tuning Large Language Models for the Ukrainian language. The goal of the task was to facilitate the creation of models that have knowledge of the Ukrainian language, history, and culture, as well as common knowledge, and are capable of generating fluent and accurate responses in Ukrainian. The participants were required to use models with open weights and reasonable size to ensure the reproducibility of the solutions. The participating systems were evaluated using multiple-choice exam questions and manually crafted open questions. Three teams submitted their solutions before the deadline, and two teams submitted papers that were accepted to appear in the UNLP workshop proceedings and are referred to in this report. The Codabench leaderboard is left open for further submissions.

2023

This paper presents the results of the UNLP 2023 shared task, the first Shared Task on Grammatical Error Correction for the Ukrainian language. The task included two tracks: GEC-only and GEC+Fluency. The dataset and evaluation scripts were provided to the participants, and the final results were evaluated on a hidden test set. Six teams submitted their solutions before the deadline, and four teams submitted papers that were accepted to appear in the UNLP workshop proceedings and are referred to in this report. The CodaLab leaderboard is left open for further submissions.
We present a corpus professionally annotated for grammatical error correction (GEC) and fluency edits in the Ukrainian language. We have built two versions of the corpus – GEC+Fluency and GEC-only – to differentiate the corpus application. To the best of our knowledge, this is the first GEC corpus for the Ukrainian language. We collected texts with errors (33,735 sentences) from a diverse pool of contributors, including both native and non-native speakers. The data cover a wide variety of writing domains, from text chats and essays to formal writing. Professional proofreaders corrected and annotated the corpus for errors relating to fluency, grammar, punctuation, and spelling. This corpus can be used for developing and evaluating GEC systems in Ukrainian. More generally, it can be used for researching multilingual and low-resource NLP, morphologically rich languages, document-level GEC, and fluency correction. The corpus is publicly available at https://github.com/grammarly/ua-gec