Dimitrios Zaikis


2025

Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.

2023

This paper presents our system for the SemEval-2023 Task 4, which aims to identify human values behind arguments by classifying whether or not an argument draws on a specific category. Our approach leverages a second-phase pre-training method to adapt a RoBERTa Language Model (LM) and tackles the problem using a One-Versus-All strategy. Final predictions are determined by a majority voting module that combines the outputs of an ensemble of three sets of per-label models. We conducted experiments to evaluate the impact of different pre-trained LMs on the task, comparing their performance in both pre-trained and task-adapted settings. Our findings show that fine-tuning the RoBERTa LM on the task-specific dataset improves its performance, outperforming the best-performing baseline BERT approach. Overall, our approach achieved a macro-F1 score of 0.47 on the official test set, demonstrating its potential in identifying human values behind arguments.