The JHU Parallel Corpus Filtering Systems for WMT 2018

Huda Khayrallah, Hainan Xu, Philipp Koehn


Abstract
This work describes our submission to the WMT18 Parallel Corpus Filtering shared task. We use a slightly modified version of the Zipporah Corpus Filtering toolkit (Xu and Koehn, 2017), which computes an adequacy score and a fluency score on a sentence pair, and use a weighted sum of the scores as the selection criteria. This work differs from Zipporah in that we experiment with using the noisy corpus to be filtered to compute the combination weights, and thus avoids generating synthetic data as in standard Zipporah.
Anthology ID:
W18-6479
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
896–899
Language:
URL:
https://aclanthology.org/W18-6479
DOI:
10.18653/v1/W18-6479
Bibkey:
Cite (ACL):
Huda Khayrallah, Hainan Xu, and Philipp Koehn. 2018. The JHU Parallel Corpus Filtering Systems for WMT 2018. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 896–899, Belgium, Brussels. Association for Computational Linguistics.
Cite (Informal):
The JHU Parallel Corpus Filtering Systems for WMT 2018 (Khayrallah et al., WMT 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W18-6479.pdf