Abstract
The National Research Council of Canada’s team submissions to the parallel corpus filtering task at the Fifth Conference on Machine Translation are based on two key components: (1) iteratively refined statistical sentence alignments for extracting sentence pairs from document pairs and (2) a crosslingual semantic textual similarity metric based on a pretrained multilingual language model, XLM-RoBERTa, with bilingual mappings learnt from a minimal amount of clean parallel data for scoring the parallelism of the extracted sentence pairs. The translation quality of the neural machine translation systems trained and fine-tuned on the parallel data extracted by our submissions improved significantly when compared to the organizers’ LASER-based baseline, a sentence-embedding method that worked well last year. For re-aligning the sentences in the document pairs (component 1), our statistical approach has outperformed the current state-of-the-art neural approach in this low-resource context.- Anthology ID:
- 2020.wmt-1.110
- Volume:
- Proceedings of the Fifth Conference on Machine Translation
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 972–978
- Language:
- URL:
- https://aclanthology.org/2020.wmt-1.110
- DOI:
- Cite (ACL):
- Chi-kiu Lo and Eric Joanis. 2020. Improving Parallel Data Identification using Iteratively Refined Sentence Alignments and Bilingual Mappings of Pre-trained Language Models. In Proceedings of the Fifth Conference on Machine Translation, pages 972–978, Online. Association for Computational Linguistics.
- Cite (Informal):
- Improving Parallel Data Identification using Iteratively Refined Sentence Alignments and Bilingual Mappings of Pre-trained Language Models (Lo & Joanis, WMT 2020)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/2020.wmt-1.110.pdf