Alibaba Submission to the WMT18 Parallel Corpus Filtering Task

Jun Lu, Xiaoyu Lv, Yangbin Shi, Boxing Chen


Abstract
This paper describes the Alibaba Machine Translation Group submissions to the WMT 2018 Shared Task on Parallel Corpus Filtering. While evaluating the quality of the parallel corpus, the three characteristics of the corpus are investigated, i.e. 1) the bilingual/translation quality, 2) the monolingual quality and 3) the corpus diversity. Both rule-based and model-based methods are adapted to score the parallel sentence pairs. The final parallel corpus filtering system is reliable, easy to build and adapt to other language pairs.
Anthology ID:
W18-6482
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
917–922
Language:
URL:
https://aclanthology.org/W18-6482
DOI:
10.18653/v1/W18-6481
Bibkey:
Cite (ACL):
Jun Lu, Xiaoyu Lv, Yangbin Shi, and Boxing Chen. 2018. Alibaba Submission to the WMT18 Parallel Corpus Filtering Task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 917–922, Belgium, Brussels. Association for Computational Linguistics.
Cite (Informal):
Alibaba Submission to the WMT18 Parallel Corpus Filtering Task (Lu et al., WMT 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/W18-6482.pdf