Quality and Coverage: The AFRL Submission to the WMT19 Parallel Corpus Filtering for Low-Resource Conditions Task

Grant Erdmann, Jeremy Gwinnup


Abstract
The WMT19 Parallel Corpus Filtering For Low-Resource Conditions Task aims to test various methods of filtering a noisy parallel corpora, to make them useful for training machine translation systems. This year the noisy corpora are the relatively low-resource language pairs of Nepali-English and Sinhala-English. This papers describes the Air Force Research Laboratory (AFRL) submissions, including preprocessing methods and scoring metrics. Numerical results indicate a benefit over baseline and the relative benefits of different options.
Anthology ID:
W19-5436
Volume:
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Month:
August
Year:
2019
Address:
Florence, Italy
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
267–270
Language:
URL:
https://aclanthology.org/W19-5436
DOI:
10.18653/v1/W19-5436
Bibkey:
Cite (ACL):
Grant Erdmann and Jeremy Gwinnup. 2019. Quality and Coverage: The AFRL Submission to the WMT19 Parallel Corpus Filtering for Low-Resource Conditions Task. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 267–270, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Quality and Coverage: The AFRL Submission to the WMT19 Parallel Corpus Filtering for Low-Resource Conditions Task (Erdmann & Gwinnup, WMT 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/W19-5436.pdf