An Unsupervised System for Parallel Corpus Filtering

Viktor Hangya, Alexander Fraser


Abstract
In this paper we describe LMU Munich’s submission for the WMT 2018 Parallel Corpus Filtering shared task which addresses the problem of cleaning noisy parallel corpora. The task of mining and cleaning parallel sentences is important for improving the quality of machine translation systems, especially for low-resource languages. We tackle this problem in a fully unsupervised fashion relying on bilingual word embeddings created without any bilingual signal. After pre-filtering noisy data we rank sentence pairs by calculating bilingual sentence-level similarities and then remove redundant data by employing monolingual similarity as well. Our unsupervised system achieved good performance during the official evaluation of the shared task, scoring only a few BLEU points behind the best systems, while not requiring any parallel training data.
Anthology ID:
W18-6477
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
882–887
Language:
URL:
https://aclanthology.org/W18-6477
DOI:
10.18653/v1/W18-6477
Bibkey:
Cite (ACL):
Viktor Hangya and Alexander Fraser. 2018. An Unsupervised System for Parallel Corpus Filtering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 882–887, Belgium, Brussels. Association for Computational Linguistics.
Cite (Informal):
An Unsupervised System for Parallel Corpus Filtering (Hangya & Fraser, WMT 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/W18-6477.pdf