NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task

Rui Wang, Benjamin Marie, Masao Utiyama, Eiichiro Sumita


Abstract
This paper presents the NICT’s participation in the WMT18 shared parallel corpus filtering task. The organizers provided 1 billion words German-English corpus crawled from the web as part of the Paracrawl project. This corpus is too noisy to build an acceptable neural machine translation (NMT) system. Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data. Finally, we sampled 100 million and 10 million words and built corresponding NMT systems. Empirical results show that our NMT systems trained on sampled data achieve promising performance.
Anthology ID:
W18-6489
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
963–967
Language:
URL:
https://aclanthology.org/W18-6489
DOI:
10.18653/v1/W18-6489
Bibkey:
Cite (ACL):
Rui Wang, Benjamin Marie, Masao Utiyama, and Eiichiro Sumita. 2018. NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 963–967, Belgium, Brussels. Association for Computational Linguistics.
Cite (Informal):
NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task (Wang et al., WMT 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/W18-6489.pdf