mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel
Abstract
The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.- Anthology ID:
- 2021.naacl-main.41
- Volume:
- Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- June
- Year:
- 2021
- Address:
- Online
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 483–498
- Language:
- URL:
- https://aclanthology.org/2021.naacl-main.41
- DOI:
- 10.18653/v1/2021.naacl-main.41
- Cite (ACL):
- Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Cite (Informal):
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer (Xue et al., NAACL 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.naacl-main.41.pdf
- Code
- google-research/multilingual-t5 + additional community code
- Data
- mC4, C4, DaNetQA, LiDiRus, MLQA, MuSeRC, PARus, PAWS-X, RCB, RWSD, RuCoS, SQuAD, TERRa, XQuAD, XTREME