2024
pdf
abs
Super donors and super recipients: Studying cross-lingual transfer between high-resource and low-resource languages
Vitaly Protasov
|
Elisei Stakovskii
|
Ekaterina Voloshina
|
Tatiana Shavrina
|
Alexander Panchenko
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)
Despite the increasing popularity of multilingualism within the NLP community, numerous languages continue to be underrepresented due to the lack of available resources.Our work addresses this gap by introducing experiments on cross-lingual transfer between 158 high-resource (HR) and 31 low-resource (LR) languages.We mainly focus on extremely LR languages, some of which are first presented in research works.Across 158*31 HR–LR language pairs, we investigate how continued pretraining on different HR languages affects the mT5 model’s performance in representing LR languages in the LM setup.Our findings surprisingly reveal that the optimal language pairs with improved performance do not necessarily align with direct linguistic motivations, with subtoken overlap playing a more crucial role. Our investigation indicates that specific languages tend to be almost universally beneficial for pretraining (super donors), while others benefit from pretraining with almost any language (super recipients). This pattern recurs in various setups and is unrelated to the linguistic similarity of HR-LR pairs.Furthermore, we perform evaluation on two downstream tasks, part-of-speech (POS) tagging and machine translation (MT), showing how HR pretraining affects LR language performance.
2022
pdf
abs
Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation
Oleg Serikov
|
Vitaly Protasov
|
Ekaterina Voloshina
|
Viktoria Knyazkova
|
Tatiana Shavrina
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Linguistic analysis of language models is one of the ways to explain and describe their reasoning, weaknesses, and limitations. In the probing part of the model interpretability research, studies concern individual languages as well as individual linguistic structures. The question arises: are the detected regularities linguistically coherent, or on the contrary, do they dissonate at the typological scale? Moreover, the majority of studies address the inherent set of languages and linguistic structures, leaving the actual typological diversity knowledge out of scope. In this paper, we present and apply the GUI-assisted framework allowing us to easily probe massive amounts of languages for all the morphosyntactic features present in the Universal Dependencies data. We show that reflecting the anglo-centric trend in NLP over the past years, most of the regularities revealed in the mBERT model are typical for the western-European languages. Our framework can be integrated with the existing probing toolboxes, model cards, and leaderboards, allowing practitioners to use and share their familiar probing methods to interpret multilingual models. Thus we propose a toolkit to systematize the multilingual flaws in multilingual models, providing a reproducible experimental setup for 104 languages and 80 morphosyntactic features.
pdf
bib
Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP
Tatiana Shavrina
|
Vladislav Mikhailov
|
Valentin Malykh
|
Ekaterina Artemova
|
Oleg Serikov
|
Vitaly Protasov
Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP