Selecting the best data filtering method for NMT training

Fred Bane, Anna Zaretskaya


Abstract
Performance of NMT systems has been proven to depend on the quality of the training data. In this paper we explore different open-source tools that can be used to score the quality of translation pairs, with the goal of obtaining clean corpora for training NMT models. We measure the performance of these tools by correlating their scores with human scores, as well as rank models trained on the resulting filtered datasets in terms of their performance on different test sets and MT performance metrics.
Anthology ID:
2021.mtsummit-up.9
Volume:
Proceedings of Machine Translation Summit XVIII: Users and Providers Track
Month:
August
Year:
2021
Address:
Virtual
Venue:
MTSummit
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
89–97
Language:
URL:
https://aclanthology.org/2021.mtsummit-up.9
DOI:
Bibkey:
Cite (ACL):
Fred Bane and Anna Zaretskaya. 2021. Selecting the best data filtering method for NMT training. In Proceedings of Machine Translation Summit XVIII: Users and Providers Track, pages 89–97, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Selecting the best data filtering method for NMT training (Bane & Zaretskaya, MTSummit 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2021.mtsummit-up.9.pdf