Noisy Parallel Corpus Filtering through Projected Word Embeddings

Murathan Kurfalı, Robert Östling


Abstract
We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.
Anthology ID:
W19-5438
Volume:
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
277–281
Language:
URL:
https://aclanthology.org/W19-5438
DOI:
10.18653/v1/W19-5438
Bibkey:
Cite (ACL):
Murathan Kurfalı and Robert Östling. 2019. Noisy Parallel Corpus Filtering through Projected Word Embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 277–281, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Noisy Parallel Corpus Filtering through Projected Word Embeddings (Kurfalı & Östling, WMT 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/W19-5438.pdf