Tilde’s Parallel Corpus Filtering Methods for WMT 2018

Mārcis Pinnis


Abstract
The paper describes parallel corpus filtering methods that allow reducing noise of noisy “parallel” corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent (over 20 BLEU points on a 10 million word dataset and up to 30 BLEU points on a 100 million word dataset). The paper also documents Tilde’s submissions to the WMT 2018 shared task on parallel corpus filtering.
Anthology ID:
W18-6486
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Venues:
EMNLP | WMT | WS
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
939–945
Language:
URL:
https://aclanthology.org/W18-6486
DOI:
10.18653/v1/W18-6486
Bibkey:
Cite (ACL):
Mārcis Pinnis. 2018. Tilde’s Parallel Corpus Filtering Methods for WMT 2018. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 939–945, Belgium, Brussels. Association for Computational Linguistics.
Cite (Informal):
Tilde’s Parallel Corpus Filtering Methods for WMT 2018 (Pinnis, 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/W18-6486.pdf