Unknown Script: Impact of Script on Cross-Lingual Transfer

Wondimagegnhue Tufa, Ilia Markov, Piek Vossen


Abstract
Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.
Anthology ID:
2024.naacl-srw.14
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yang (Trista) Cao, Isabel Papadimitriou, Anaelia Ovalle, Marcos Zampieri, Francis Ferraro, Swabha Swayamdipta
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
124–129
Language:
URL:
https://aclanthology.org/2024.naacl-srw.14
DOI:
10.18653/v1/2024.naacl-srw.14
Bibkey:
Cite (ACL):
Wondimagegnhue Tufa, Ilia Markov, and Piek Vossen. 2024. Unknown Script: Impact of Script on Cross-Lingual Transfer. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 124–129, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Unknown Script: Impact of Script on Cross-Lingual Transfer (Tufa et al., NAACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.naacl-srw.14.pdf