Noisy Parallel Corpus Filtering through Projected Word Embeddings

Murathan Kurfalı, Robert Östling


Abstract
We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.
Anthology ID:
W19-5438
Volume:
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Month:
August
Year:
2019
Address:
Florence, Italy
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
277–281
Language:
URL:
https://aclanthology.org/W19-5438
DOI:
10.18653/v1/W19-5438
Bibkey:
Cite (ACL):
Murathan Kurfalı and Robert Östling. 2019. Noisy Parallel Corpus Filtering through Projected Word Embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 277–281, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Noisy Parallel Corpus Filtering through Projected Word Embeddings (Kurfalı & Östling, WMT 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W19-5438.pdf