Tilde’s Parallel Corpus Filtering Methods for WMT 2018

Mārcis Pinnis


Abstract
The paper describes parallel corpus filtering methods that allow reducing noise of noisy “parallel” corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent (over 20 BLEU points on a 10 million word dataset and up to 30 BLEU points on a 100 million word dataset). The paper also documents Tilde’s submissions to the WMT 2018 shared task on parallel corpus filtering.
Anthology ID:
W18-6486
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
939–945
Language:
URL:
https://aclanthology.org/W18-6486
DOI:
10.18653/v1/W18-6486
Bibkey:
Cite (ACL):
Mārcis Pinnis. 2018. Tilde’s Parallel Corpus Filtering Methods for WMT 2018. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 939–945, Belgium, Brussels. Association for Computational Linguistics.
Cite (Informal):
Tilde’s Parallel Corpus Filtering Methods for WMT 2018 (Pinnis, WMT 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/W18-6486.pdf