Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

Finn Schmidt, Jan Philip Wahle, Terry Ruas, Bela Gipp


Abstract
Automatic evaluation metrics are central to the development of machine translation systems, yet their robustness under domain shift remains unclear. Most metrics are developed on the Workshop on Machine Translation (WMT) benchmarks, raising concerns about their robustness to unseen domains. Prior studies that analyze unseen domains vary translation systems, annotators, or evaluation conditions, confounding domain effects with human annotation noise.To address these biases, we introduce a systematic multi-annotator **C**ross-**D**omain **E**rror-**S**pan-**A**nnotation dataset (CD-ESA), comprising 18.8k human error span annotations across three language pairs, where we fix annotators within each language pair and evaluate translations of the same six translation systems across one seen news domain and two unseen technical domains. Using this dataset, we first find that automatic metrics appear surprisingly robust to domain-shifts at the segment level (up to 0.69 agreement), but this robustness largely disappears once we account for human label variation. Averaging annotations increases inter-annotator agreement by up to +0.11. Metrics struggle on the unseen chemical domain compared to humans (inter-annotator agreement of 0.78–0.83 vs. 0.96).We recommend comparing metric–human agreement against inter-annotator agreement, rather than comparing raw metric–human agreement alone, when evaluating across different domains.
Anthology ID:
2026.findings-acl.1145
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22822–22841
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1145/
DOI:
Bibkey:
Cite (ACL):
Finn Schmidt, Jan Philip Wahle, Terry Ruas, and Bela Gipp. 2026. Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains. In Findings of the Association for Computational Linguistics: ACL 2026, pages 22822–22841, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains (Schmidt et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1145.pdf
Checklist:
 2026.findings-acl.1145.checklist.pdf