mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel


Abstract
The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.
Anthology ID:
2021.naacl-main.41
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
483–498
Language:
URL:
https://aclanthology.org/2021.naacl-main.41
DOI:
10.18653/v1/2021.naacl-main.41
Bibkey:
Cite (ACL):
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
Cite (Informal):
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer (Xue et al., NAACL 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2021.naacl-main.41.pdf
Video:
 https://preview.aclanthology.org/ingestion-script-update/2021.naacl-main.41.mp4
Code
 google-research/multilingual-t5 +  additional community code
Data
mC4C4DaNetQALiDiRusMLQAMuSeRCPARusPAWS-XRCBRWSDRuCoSSQuADTERRaXQuADXTREME