Olga Pelloni


TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP
Steven Moran | Christian Bentz | Ximena Gutierrez-Vasques | Olga Pelloni | Tanja Samardzic
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present the TeDDi sample, a diversity sample of text data for language comparison and multilingual Natural Language Processing. The TeDDi sample currently features 89 languages based on the typological diversity sample in the World Atlas of Language Structures. It consists of more than 20k texts and is accompanied by open-source corpus processing tools. The aim of TeDDi is to facilitate text-based quantitative analysis of linguistic diversity. We describe in detail the TeDDi sample, how it was created, data availability, and its added value through for NLP and linguistic research.

On Language Spaces, Scales and Cross-Lingual Transfer of UD Parsers
Tanja Samardžić | Ximena Gutierrez-Vasques | Rob van der Goot | Max Müller-Eberstein | Olga Pelloni | Barbara Plank
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Cross-lingual transfer of parsing models has been shown to work well for several closely-related languages, but predicting the success in other cases remains hard. Our study is a comprehensive analysis of the impact of linguistic distance on the transfer of UD parsers. As an alternative to syntactic typological distances extracted from URIEL, we propose three text-based feature spaces and show that they can be more precise predictors, especially on a more local scale, when only shorter distances are taken into account. Our analyses also reveal that the good coverage in typological databases is not among the factors that explain good transfer.

Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages
Olga Pelloni | Anastassia Shaitarova | Tanja Samardzic
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the performance on various tasks in low-resource languages via cross-lingual transfer. In this framework, English is usually seen as the most natural choice for a transfer language (for fine-tuning or continued training of a multilingual pre-trained model), but it has been revealed recently that this is often not the best choice. The success of cross-lingual transfer seems to depend on some properties of languages, which are currently hard to explain. Successful transfer often happens between unrelated languages and it often cannot be explained by data-dependent factors.In this study, we show that languages written in non-Latin and non-alphabetic scripts (mostly Asian languages) are the best choices for improving performance on the task of Masked Language Modelling (MLM) in a diverse set of 30 low-resource languages and that the success of the transfer is well predicted by our novel measure of Subword Evenness (SuE). Transferring language models over the languages that score low on our measure results in the lowest average perplexity over target low-resource languages. Our correlation coefficients obtained with three different pre-trained multilingual models are consistently higher than all the other predictors, including text-based measures (type-token ratio, entropy) and linguistically motivated choice (genealogical and typological proximity).