Abstract
In this document we describe our submission to the parallel corpus filtering task using multilingual word embedding, language models and an ensemble of pre and post filtering rules. We use the norms of embedding and the perplexities of language models along with pre/post filtering rules to complement the LASER baseline scores and in the end get an improvement on the dev set in both language pairs.- Anthology ID:
- 2020.wmt-1.108
- Volume:
- Proceedings of the Fifth Conference on Machine Translation
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Editors:
- Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 959–965
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2020.wmt-1.108/
- DOI:
- Cite (ACL):
- Ankur Kejriwal and Philipp Koehn. 2020. An exploratory approach to the Parallel Corpus Filtering shared task WMT20. In Proceedings of the Fifth Conference on Machine Translation, pages 959–965, Online. Association for Computational Linguistics.
- Cite (Informal):
- An exploratory approach to the Parallel Corpus Filtering shared task WMT20 (Kejriwal & Koehn, WMT 2020)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2020.wmt-1.108.pdf