Filtering Noisy Parallel Corpus using Transformers with Proxy Task Learning

Haluk Açarçiçek, Talha Çolakoğlu, Pınar Ece Aktan Hatipoğlu, Chong Hsuan Huang, Wei Peng


Abstract
This paper illustrates Huawei’s submission to the WMT20 low-resource parallel corpus filtering shared task. Our approach focuses on developing a proxy task learner on top of a transformer-based multilingual pre-trained language model to boost the filtering capability for noisy parallel corpora. Such a supervised task also helps us to iterate much more quickly than using an existing neural machine translation system to perform the same task. After performing empirical analyses of the finetuning task, we benchmark our approach by comparing the results with past years’ state-of-theart records. This paper wraps up with a discussion of limitations and future work. The scripts for this study will be made publicly available.
Anthology ID:
2020.wmt-1.105
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
940–946
Language:
URL:
https://aclanthology.org/2020.wmt-1.105
DOI:
Bibkey:
Cite (ACL):
Haluk Açarçiçek, Talha Çolakoğlu, Pınar Ece Aktan Hatipoğlu, Chong Hsuan Huang, and Wei Peng. 2020. Filtering Noisy Parallel Corpus using Transformers with Proxy Task Learning. In Proceedings of the Fifth Conference on Machine Translation, pages 940–946, Online. Association for Computational Linguistics.
Cite (Informal):
Filtering Noisy Parallel Corpus using Transformers with Proxy Task Learning (Açarçiçek et al., WMT 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.wmt-1.105.pdf
Video:
 https://slideslive.com/38939606
Code
 wpti/proxy-filter