Context Is Ubiquitous, but Rarely Changes Judgments: Revisiting Document-Level MT Evaluation

Ahrii Kim


Abstract
As sentence-level performance in modern Machine Translation (MT) has plateaued, reliable document-level evaluation is increasingly needed. While the recent FALCON framework with pragmatic features offers a promising direction, its reliability and reproducibility are unclear. We address this gap through human evaluation, analyzing sources of low inter-annotator agreement and identifying key factors. Based on these findings, we introduce H-FALCON, a Human-centered refinement of FALCON. Our experiments show that, even with limited annotator consensus, FALCON achieves correlations comparable to or better than standard sentence-level protocols.Furthermore, we find that contextual information is inherent in all sentences, challenging the view that only some require it. This suggests that prior estimates such as “n% of sentences require context” may stem from methodological artifacts. At the same time, we show that while context is pervasive, not all of it directly influences human judgment.
Anthology ID:
2025.wmt-1.5
Volume:
Proceedings of the Tenth Conference on Machine Translation
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
81–97
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.5/
DOI:
Bibkey:
Cite (ACL):
Ahrii Kim. 2025. Context Is Ubiquitous, but Rarely Changes Judgments: Revisiting Document-Level MT Evaluation. In Proceedings of the Tenth Conference on Machine Translation, pages 81–97, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Context Is Ubiquitous, but Rarely Changes Judgments: Revisiting Document-Level MT Evaluation (Kim, WMT 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.wmt-1.5.pdf