Correlating automated and human assessments of machine translation quality

Deborah Coughlin


Abstract
We describe a large-scale investigation of the correlation between human judgments of machine translation quality and the automated metrics that are increasingly used to drive progress in the field. We compare the results of 124 human evaluations of machine translated sentences to the scores generated by two automatic evaluation metrics (BLEU and NIST). When datasets are held constant or file size is sufficiently large, BLEU and NIST scores closely parallel human judgments. Surprisingly, this was true even though these scores were calculated using just one human reference. We suggest that when human evaluators are forced to make decisions without sufficient context or domain expertise, they fall back on strategies that are not unlike determining n-gram precision.
Anthology ID:
2003.mtsummit-papers.9
Volume:
Proceedings of Machine Translation Summit IX: Papers
Month:
September 23-27
Year:
2003
Address:
New Orleans, USA
Venue:
MTSummit
SIG:
Publisher:
Note:
Pages:
Language:
URL:
https://aclanthology.org/2003.mtsummit-papers.9
DOI:
Bibkey:
Cite (ACL):
Deborah Coughlin. 2003. Correlating automated and human assessments of machine translation quality. In Proceedings of Machine Translation Summit IX: Papers, New Orleans, USA.
Cite (Informal):
Correlating automated and human assessments of machine translation quality (Coughlin, MTSummit 2003)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-bitext-workshop/2003.mtsummit-papers.9.pdf