MITRA-zh-eval: Using a Buddhist Chinese Language Evaluation Dataset to Assess Machine Translation and Evaluation Metrics

Sebastian Nehrdich; Avery Chen; Marcus Bingenheimer; Lu Huang; Rouying Tang; Xiang Wei; Leijie Zhu; Kurt Keutzer

MITRA-zh-eval: Using a Buddhist Chinese Language Evaluation Dataset to Assess Machine Translation and Evaluation Metrics

Sebastian Nehrdich, Avery Chen, Marcus Bingenheimer, Lu Huang, Rouying Tang, Xiang Wei, Leijie Zhu, Kurt Keutzer

Abstract

With the advent of large language models, machine translation (MT) has become a widely used, but little understood, tool for accessing historical and multilingual texts. While models like GPT, Claude, and Deepseek increasingly enable translation of low-resource and ancient languages, critical questions remain about their evaluation, optimal model selection, and the value of domain-specific training and retrieval-augmented generation setups.While AI models like GPT, Claude, and Deepseek are improving translation capabilities for low-resource and ancient languages, researchers still face important questions about how to evaluate their performance, which models work best, and whether specialized training approaches provide meaningful improvements in translation quality.This study introduces a comprehensive evaluation dataset for Buddhist Chinese to English translation, comprising 2,662 bilingual data points from 32 texts that have been selected to represent the full breadth of the Chinese Buddhist canon.We evaluate various computational metrics of translation quality (BLEU, chrF, BLEURT, GEMBA) against expert annotations from five domain specialists who rated 182 machine-generated translations. Our analysis reveals that LLM-based GEMBA scoring shows the strongest correlation with human judgment, significantly outperforming traditional metrics. We then benchmark commercial models (GPT-4 Turbo, Claude 3.5, Gemini), open-source models (Gemma 2, Deepseek-r1), and a domain-specialized model (Gemma 2 Mitra) using GEMBA. Our results demonstrate that domain-specific training enables open-weights models to achieve competitive performance with commercial systems, while also showing that retrieval-augmented generation (RAG) significantly improves translation quality for the best performing commercial models.

Anthology ID:: 2025.nlp4dh-1.12
Volume:: Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Month:: May
Year:: 2025
Address:: Albuquerque, USA
Editors:: Mika Hämäläinen, Emily Öhman, Yuri Bizzoni, So Miyagawa, Khalid Alnajjar
Venues:: NLP4DH | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 129–137
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.12/
DOI:
Bibkey:
Cite (ACL):: Sebastian Nehrdich, Avery Chen, Marcus Bingenheimer, Lu Huang, Rouying Tang, Xiang Wei, Leijie Zhu, and Kurt Keutzer. 2025. MITRA-zh-eval: Using a Buddhist Chinese Language Evaluation Dataset to Assess Machine Translation and Evaluation Metrics. In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, pages 129–137, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):: MITRA-zh-eval: Using a Buddhist Chinese Language Evaluation Dataset to Assess Machine Translation and Evaluation Metrics (Nehrdich et al., NLP4DH 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.12.pdf

PDF Cite Search Fix data