Abstract
The paper describes parallel corpus filtering methods that allow reducing noise of noisy “parallel” corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent (over 20 BLEU points on a 10 million word dataset and up to 30 BLEU points on a 100 million word dataset). The paper also documents Tilde’s submissions to the WMT 2018 shared task on parallel corpus filtering.- Anthology ID:
- W18-6486
- Volume:
- Proceedings of the Third Conference on Machine Translation: Shared Task Papers
- Month:
- October
- Year:
- 2018
- Address:
- Belgium, Brussels
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 939–945
- Language:
- URL:
- https://aclanthology.org/W18-6486
- DOI:
- 10.18653/v1/W18-6486
- Cite (ACL):
- Mārcis Pinnis. 2018. Tilde’s Parallel Corpus Filtering Methods for WMT 2018. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 939–945, Belgium, Brussels. Association for Computational Linguistics.
- Cite (Informal):
- Tilde’s Parallel Corpus Filtering Methods for WMT 2018 (Pinnis, WMT 2018)
- PDF:
- https://preview.aclanthology.org/author-url/W18-6486.pdf