Abstract
We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.- Anthology ID:
- W19-5438
- Volume:
- Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, Karin Verspoor
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 277–281
- Language:
- URL:
- https://aclanthology.org/W19-5438
- DOI:
- 10.18653/v1/W19-5438
- Cite (ACL):
- Murathan Kurfalı and Robert Östling. 2019. Noisy Parallel Corpus Filtering through Projected Word Embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 277–281, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Noisy Parallel Corpus Filtering through Projected Word Embeddings (Kurfalı & Östling, WMT 2019)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/W19-5438.pdf