Data Filtering using Cross-Lingual Word Embeddings
Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, Hermann Ney
Abstract
Data filtering for machine translation (MT) describes the task of selecting a subset of a given, possibly noisy corpus with the aim to maximize the performance of an MT system trained on this selected data. Over the years, many different filtering approaches have been proposed. However, varying task definitions and data conditions make it difficult to draw a meaningful comparison. In the present work, we aim for a more systematic approach to the task at hand. First, we analyze the performance of language identification, a tool commonly used for data filtering in the MT community and identify specific weaknesses. Based on our findings, we then propose several novel methods for data filtering, based on cross-lingual word embeddings. We compare our approaches to one of the winning methods from the WMT 2018 shared task on parallel corpus filtering on three real-life, high resource MT tasks. We find that said method, which was performing very strong in the WMT shared task, does not perform well within our more realistic task conditions. While we find that our approaches come out at the top on all three tasks, different variants perform best on different tasks. Further experiments on the WMT 2020 shared task for parallel corpus filtering show that our methods achieve comparable results to the strongest submissions of this campaign.- Anthology ID:
- 2021.naacl-main.15
- Volume:
- Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Editors:
- Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 162–172
- Language:
- URL:
- https://aclanthology.org/2021.naacl-main.15
- DOI:
- 10.18653/v1/2021.naacl-main.15
- Cite (ACL):
- Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, and Hermann Ney. 2021. Data Filtering using Cross-Lingual Word Embeddings. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 162–172, Online. Association for Computational Linguistics.
- Cite (Informal):
- Data Filtering using Cross-Lingual Word Embeddings (Herold et al., NAACL 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2021.naacl-main.15.pdf