2025
pdf
bib
abs
On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility
Kushal Tatariya
|
Wessel Poelman
|
Miryam de Lhoneux
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Language model architectures are predominately first created for English and afterwards applied to other languages. This can lead to problems for languages that are structurally different from English. We study one specific architectural choice: positional encodings. We do this through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis states there exists a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pre-train and evaluate three monolingual model variants with absolute, relative and no position encodings for seven typologically diverse languages and evaluate on four downstream tasks. We fail to find a consistent trend with various proxies for morphological complexity and word order flexibility. Our work shows choice of tasks, languages, and metrics are essential for drawing stable conclusions.
2024
pdf
bib
abs
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models
Kushal Tatariya
|
Vladimir Araujo
|
Thomas Bauwens
|
Miryam de Lhoneux
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text. While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowledge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum. Our findings reveal a substantial gap between the model’s visual and linguistic understanding. The lower layers of PIXEL predominantly capture superficial visual features, whereas the higher layers gradually learn more syntactic and semantic abstractions. Additionally, we examine variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input level can facilitate earlier learning of surface-level features. With this study, we hope to provide insights that aid the further development of pixel-based language models.
pdf
bib
abs
Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification
Kushal Tatariya
|
Heather Lent
|
Johannes Bjerva
|
Miryam de Lhoneux
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Emotion classification is a challenging task in NLP due to the inherent idiosyncratic and subjective nature of linguistic expression,especially with code-mixed data. Pre-trained language models (PLMs) have achieved high performance for many tasks and languages, but it remains to be seen whether these models learn and are robust to the differences in emotional expression across languages. Sociolinguistic studies have shown that Hinglish speakers switch to Hindi when expressing negative emotions and to English when expressing positive emotions. To understand if language models can learn these associations, we study the effect of language on emotion prediction across 3 PLMs on a Hinglish emotion classification dataset. Using LIME and token level language ID, we find that models do learn these associations between language choice and emotional expression. Moreover, having code-mixed data present in the pre-training can augment that learning when task-specific data is scarce. We also conclude from the misclassifications that the models may overgeneralise this heuristic to other infrequent examples where this sociolinguistic phenomenon does not apply.
pdf
bib
abs
CreoleVal: Multilingual Multitask Benchmarks for Creoles
Heather Lent
|
Kushal Tatariya
|
Raj Dabre
|
Yiyi Chen
|
Marcell Fekete
|
Esther Ploeger
|
Li Zhou
|
Ruth-Ann Armstrong
|
Abee Eijansantos
|
Catriona Malau
|
Hans Erik Heje
|
Ernests Lavrinovics
|
Diptesh Kanojia
|
Paul Belony
|
Marcel Bollmann
|
Loïc Grobol
|
Miryam de Lhoneux
|
Daniel Hershcovich
|
Michel DeGraff
|
Anders Søgaard
|
Johannes Bjerva
Transactions of the Association for Computational Linguistics, Volume 12
Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.
2023
pdf
bib
abs
Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter?
Kushal Tatariya
|
Heather Lent
|
Miryam de Lhoneux
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Monolinguals make up a minority of the world’s speakers, and yet most language technologies lag behind in handling linguistic behaviours produced by bilingual and multilingual speakers. A commonly observed phenomenon in such communities is code-mixing, which is prevalent on social media, and thus requires attention in NLP research. In this work, we look into the ability of pretrained language models to handle code-mixed data, with a focus on the impact of languages present in pretraining on the downstream performance of the model as measured on the task of sentiment analysis. Ultimately, we find that the pretraining language has little effect on performance when the model sees code-mixed data during downstream finetuning. We also evaluate the models on code-mixed data in a zero-shot setting, after task-specific finetuning on a monolingual dataset. We find that this brings out differences in model performance that can be attributed to the pretraining languages. We present a thorough analysis of these findings that also looks at model performance based on the composition of participating languages in the code-mixed datasets.