Evaluating Corpus Cleanup Methods in the WMT’22 News Translation Task

Marilena Malli, George Tambouratzis


Abstract
This submission to the WMT22: General MT Task, consists of translations produced from a series of NMT models of the following two language pairs: german-to-english and german-to-french. All the models are trained using only the parallel training data specified by WMT22, and no monolingual training data was used. The models follow the transformer architecture employing 8 attention heads and 6 layers in both the encoder and decoder. It is also worth mentioning that, in order to limit the computational resources that we would use during the training process, we decided to train the majority of models by limiting the training to 21 epochs. Moreover, the translations submitted at WMT22 have been produced using the test data released by the WMT22.The aim of our experiments has been to evaluate methods for cleaning-up a parallel corpus to determine if this will lead to a translation model producing more accurate translations. For each language pair, the base NMT models has been trained from raw parallel training corpora, while the additional NMT models have been trained with corpora subjected to a special cleaning process with the following tools: Bifixer and Bicleaner. It should be mentioned that the Bicleaner repository doesn’t provide pre-trained classifiers for the above language pairs, consequently we trained probabilistic dictionaries in order to produce new models. The fundamental differences between these NMT models produced are mainly related to the quality and the quantity of the training data, while there are very few differences in the training parameters. To complete this work, we used the following three software packages: (i) MARIAN NMT (Version: v1.11.5), which was used for the training of the neural machine translation models and (ii) Bifixer and (iii) Bicleaner, which were used in order to correct and clean the parallel training data. Concerning the Bifixer and Bicleaner tools, we followed all the steps as described meticulously in the following article: “Ramírez-Sánchez, G., Zaragoza-Bernabeu, J., Bañón, M., & Rojas, S.O. (2020). Bifixer and Bicleaner: two open-source tools to clean your parallel data. EAMT. ” and also in the official github pages: https://github.com/bitextor/bifixer, https://github.com/bitextor/bicleaner.
Anthology ID:
2022.wmt-1.27
Volume:
Proceedings of the Seventh Conference on Machine Translation (WMT)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
335–341
Language:
URL:
https://aclanthology.org/2022.wmt-1.27
DOI:
Bibkey:
Cite (ACL):
Marilena Malli and George Tambouratzis. 2022. Evaluating Corpus Cleanup Methods in the WMT’22 News Translation Task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 335–341, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Evaluating Corpus Cleanup Methods in the WMT’22 News Translation Task (Malli & Tambouratzis, WMT 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nodalida-main-page/2022.wmt-1.27.pdf