Estimating the predictive Power of N-gram MT Evaluation Metrics across Language and Text Types

Bogdan Babych, Anthony Hartley, Debbie Elliott


Abstract
The use of n-gram metrics to evaluate the output of MT systems is widespread. Typically, they are used in system development, where an increase in the score is taken to represent an improvement in the output of the system. However, purchasers of MT systems or services are more concerned to know how well a score predicts the acceptability of the output to a reader-user. Moreover, they usually want to know if these predictions will hold across a range of target languages and text types. We describe an experiment involving human and automated evaluations of four MT systems across two text types and 23 language directions. It establishes that the correlation between human and automated scores is high, but that the predictive power of these scores depends crucially on target language and text type.
Anthology ID:
2005.mtsummit-posters.13
Volume:
Proceedings of Machine Translation Summit X: Posters
Month:
September 13-15
Year:
2005
Address:
Phuket, Thailand
Venue:
MTSummit
SIG:
Publisher:
Note:
Pages:
412–418
Language:
URL:
https://aclanthology.org/2005.mtsummit-posters.13
DOI:
Bibkey:
Cite (ACL):
Bogdan Babych, Anthony Hartley, and Debbie Elliott. 2005. Estimating the predictive Power of N-gram MT Evaluation Metrics across Language and Text Types. In Proceedings of Machine Translation Summit X: Posters, pages 412–418, Phuket, Thailand.
Cite (Informal):
Estimating the predictive Power of N-gram MT Evaluation Metrics across Language and Text Types (Babych et al., MTSummit 2005)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2005.mtsummit-posters.13.pdf