Abstract
When we transfer a pretrained language model to a new language, there are many axes of variation that change at once. To disentangle the impact of different factors like syntactic similarity and vocabulary similarity, we propose a set of controlled transfer studies: we systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time, and then measure the resulting drops in a pretrained model’s downstream performance. We find that models can largely recover from syntactic-style shifts, but cannot recover from vocabulary misalignment and embedding matrix re-initialization, even with continued pretraining on 15 million tokens. Moreover, good-quality tokenizers in the transfer language do not make vocabulary alignment easier. Our experiments provide insights into the factors of cross-lingual transfer that researchers should most focus on when designing language transfer scenarios.- Anthology ID:
- 2023.emnlp-main.198
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3280–3289
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-main.198
- DOI:
- 10.18653/v1/2023.emnlp-main.198
- Cite (ACL):
- Zhengxuan Wu, Alex Tamkin, and Isabel Papadimitriou. 2023. Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3280–3289, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies (Wu et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2023.emnlp-main.198.pdf