Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation

Eric R. Bennett, HyoJung Han, Xinchen Yang, Andrew Schonebaum, Marine Carpuat


Abstract
Evaluation metrics are an important driver of progress in Machine Translation (MT), but they have been primarily validated on high-resource modern languages. In this paper, we conduct an empirical evaluation of metrics commonly used to evaluate MT from Ancient Chinese into English. Using LLMs, we construct a contrastive test set, pairing high-quality MT and purposefully flawed MT of the same Pre-Qin texts. We then evaluate the ability of each metric to discriminate between accurate and flawed translations.
Anthology ID:
2025.alp-1.9
Volume:
Proceedings of the Second Workshop on Ancient Language Processing
Month:
May
Year:
2025
Address:
The Albuquerque Convention Center, Laguna
Editors:
Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti, Rachele Sprugnoli
Venues:
ALP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
71–76
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.alp-1.9/
DOI:
Bibkey:
Cite (ACL):
Eric R. Bennett, HyoJung Han, Xinchen Yang, Andrew Schonebaum, and Marine Carpuat. 2025. Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation. In Proceedings of the Second Workshop on Ancient Language Processing, pages 71–76, The Albuquerque Convention Center, Laguna. Association for Computational Linguistics.
Cite (Informal):
Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation (Bennett et al., ALP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.alp-1.9.pdf