Volctrans Parallel Corpus Filtering System for WMT 2020

Runxin Xu, Zhuo Zhi, Jun Cao, Mingxuan Wang, Lei Li


Abstract
In this paper, we describe our submissions to the WMT20 shared task on parallel corpus filtering and alignment for low-resource conditions. The task requires the participants to align potential parallel sentence pairs out of the given document pairs, and score them so that low-quality pairs can be filtered. Our system, Volctrans, is made of two modules, i.e., a mining module and a scoring module. Based on the word alignment model, the mining mod- ule adopts an iterative mining strategy to extract latent parallel sentences. In the scoring module, an XLM-based scorer provides scores, followed by reranking mechanisms and ensemble. Our submissions outperform the baseline by 3.x/2.x and 2.x/2.x for km-en and ps-en on From Scratch/Fine-Tune conditions.
Anthology ID:
2020.wmt-1.112
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Editors:
Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
985–990
Language:
URL:
https://aclanthology.org/2020.wmt-1.112
DOI:
Bibkey:
Cite (ACL):
Runxin Xu, Zhuo Zhi, Jun Cao, Mingxuan Wang, and Lei Li. 2020. Volctrans Parallel Corpus Filtering System for WMT 2020. In Proceedings of the Fifth Conference on Machine Translation, pages 985–990, Online. Association for Computational Linguistics.
Cite (Informal):
Volctrans Parallel Corpus Filtering System for WMT 2020 (Xu et al., WMT 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2020.wmt-1.112.pdf
Video:
 https://slideslive.com/38939544