Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining
Francis Zheng, Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo
Abstract
This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of fairseq to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and chrF scores that were 0.0749 higher than the baseline.- Anthology ID:
- 2021.americasnlp-1.26
- Original:
- 2021.americasnlp-1.26v1
- Version 2:
- 2021.americasnlp-1.26v2
- Volume:
- Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Editors:
- Manuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, Katharina Kann
- Venue:
- AmericasNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 234–240
- Language:
- URL:
- https://aclanthology.org/2021.americasnlp-1.26
- DOI:
- 10.18653/v1/2021.americasnlp-1.26
- Cite (ACL):
- Francis Zheng, Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 234–240, Online. Association for Computational Linguistics.
- Cite (Informal):
- Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining (Zheng et al., AmericasNLP 2021)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2021.americasnlp-1.26.pdf