Colten DiIanni

2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
Parker Riley | Daniel Deutsch | Mara Finkelstein | Colten DiIanni | Juraj Juraska | Markus Freitag
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To improve annotation quality, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an annotator reviews and edits a set of prior MQM annotations that may have come from themselves, another human annotator, or an automatic system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.

2025

pdf bib abs

Don’t Sweat the Small Stuff: Segment-Level Meta-Evaluation Based on Pairwise Difference Correlation
Colten DiIanni | Daniel Deutsch
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

This paper introduces Pairwise Difference Pearson (PDP), a novel segment-level meta-evaluation metric for Machine Translation (MT) that addresses limitations in previous Pearson’s 𝜌-based and Kendall’s 𝜏-based meta-evaluation approaches. PDP is a correlation-based metric that utilizes pairwise differences rather than raw scores. It draws on information from all segments for a more robust understanding of score distributions and uses only pairwise differences to refine Global Pearson to intra-segment comparisons. Analysis on the WMT’24 shared task shows PDP properly ranks sentinel evaluation metrics and better aligns with human error weightings than acc_eq.

Co-authors

Venues

ACL1
EMNLP1

Fix author