Kyrylo Zakharov

2026

Professional Translators Versus Quality Estimation Models: Reliability and Agreement in English-Ukrainian Translation Evaluation
Dmytro Chaplynskyi | Kyrylo Zakharov | Lesia Ivashkevych
Proceedings of the Fifth Ukrainian Natural Language Processing Conference (UNLP 2026)

We extend a prior study comparing automatic Quality Estimation (QE) models with crowdsourced student judgments for English–Ukrainian parallel corpus evaluation. Eight professional translators each rate 1,000 sentence pairs on a continuous 0–100 scale under one of two paradigms: holistic quality scoring or a two-stage fluency-plus-adequacy protocol, with a repeated task for test–retest reliability. Professionals using the holistic scale achieve significantly higher inter-rater reliability than both linguistics students and professionals using separate fluency and adequacy scales, contradicting the expectation that multidimensional evaluation improves agreement. Adequacy correlates strongly with holistic judgments while fluency emerges as a largely independent dimension. Experts also exhibit a significant leniency drift over the session, alongside increasing evaluation speed. We additionally evaluate three LLMs as translation quality judges (Gemini 3 Flash, GPT-5.4, Gemma 3 27B) and find that the two larger models modestly outperform dedicated QE models in correlation with expert scores (r = 0.814–0.821 vs. r ≤ 0.747). When prompted for separate fluency and adequacy scores, the LLMs replicate the adequacy-dominance pattern, confirming that meaning preservation drives holistic quality perception across both human and machine judges.

2025

pdf bib abs

A Framework for Large-Scale Parallel Corpus Evaluation: Ensemble Quality Estimation Models Versus Human Assessment
Dmytro Chaplynskyi | Kyrylo Zakharov
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)

We developed a methodology and a framework for automatically evaluating and filtering large-scale parallel corpora for neural machine translation (NMT). We applied six modern Quality Estimation (QE) models to score 55 million English-Ukrainian sentence pairs and conducted human evaluation on a stratified sample of 9,755 pairs. Using the obtained data, we ran a thorough statistical analysis to assess the performance of selected QE models and build linear, quadratic and beta regression models on the ensemble to estimate human quality judgments from automatic metrics. Our best ensemble model explained approximately 60% of the variance in expert ratings. We also found a non-linear relationship between automatic metrics and human quality perception, indicating that automatic metrics can be used to predict the human score. Our findings will facilitate further research in parallel corpus filtering and quality estimation and ultimately contribute to higher-quality NMT systems. We are releasing our framework, the evaluated corpus with quality scores, and the human evaluation dataset to support further research in this area.

2023

pdf bib abs

Learning Word Embeddings for Ukrainian: A Comparative Study of FastText Hyperparameters
Nataliia Romanyshyn | Dmytro Chaplynskyi | Kyrylo Zakharov
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)

This study addresses the challenges of learning unsupervised word representations for the morphologically rich and low-resource Ukrainian language. Traditional models that perform decently on English do not generalize well for such languages due to a lack of sufficient data and the complexity of their grammatical structures. To overcome these challenges, we utilized a high-quality, large dataset of different genres for learning Ukrainian word vector representations. We found the best hyperparameters to train fastText language models on this dataset and performed intrinsic and extrinsic evaluations of the generated word embeddings using the established methods and metrics. The results of this study indicate that the trained vectors exhibit superior performance on intrinsic tests in comparison to existing embeddings for Ukrainian. Our best model gives 62% Accuracy on the word analogy task. Extrinsic evaluations were performed on two sequence labeling tasks: NER and POS tagging (83% spaCy NER F-score, 83% spaCy POS Accuracy, 92% Flair POS Accuracy).

Co-authors

Venues

UNLP3
WS1

Fix author