Abstract
Performance of NMT systems has been proven to depend on the quality of the training data. In this paper we explore different open-source tools that can be used to score the quality of translation pairs, with the goal of obtaining clean corpora for training NMT models. We measure the performance of these tools by correlating their scores with human scores, as well as rank models trained on the resulting filtered datasets in terms of their performance on different test sets and MT performance metrics.- Anthology ID:
- 2021.mtsummit-up.9
- Volume:
- Proceedings of Machine Translation Summit XVIII: Users and Providers Track
- Month:
- August
- Year:
- 2021
- Address:
- Virtual
- Editors:
- Janice Campbell, Ben Huyck, Stephen Larocca, Jay Marciano, Konstantin Savenkov, Alex Yanishevsky
- Venue:
- MTSummit
- SIG:
- Publisher:
- Association for Machine Translation in the Americas
- Note:
- Pages:
- 89–97
- Language:
- URL:
- https://aclanthology.org/2021.mtsummit-up.9
- DOI:
- Cite (ACL):
- Fred Bane and Anna Zaretskaya. 2021. Selecting the best data filtering method for NMT training. In Proceedings of Machine Translation Summit XVIII: Users and Providers Track, pages 89–97, Virtual. Association for Machine Translation in the Americas.
- Cite (Informal):
- Selecting the best data filtering method for NMT training (Bane & Zaretskaya, MTSummit 2021)
- PDF:
- https://preview.aclanthology.org/teach-a-man-to-fish/2021.mtsummit-up.9.pdf