Tomasz Limisiewicz


2024

pdf
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Tomasz Limisiewicz | Terra Blevins | Hila Gonen | Orevaoghene Ahia | Luke Zettlemoyer
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts.Although contemporary text encoding methods cover most of the world’s writing systems, they exhibit bias towards the high-resource languages of the Global West. As a result, texts of underrepresented languages tend to be segmented into long sequences of linguistically meaningless units. To address the disparities, we introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. Our encoding convention (MYTE) is based on morphemes, as their inventories are more balanced across languages than characters, which are used in previous methods. We show that MYTE produces shorter encodings for all 99 analyzed languages, with the most notable improvements for non-European languages and non-Latin scripts. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.

pdf
Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models
Terra Blevins | Tomasz Limisiewicz | Suchin Gururangan | Margaret Li | Hila Gonen | Noah A. Smith | Luke Zettlemoyer
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all 16 considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

pdf
Teaching LLMs at Charles University: Assignments and Activities
Jindřich Helcl | Zdeněk Kasner | Ondřej Dušek | Tomasz Limisiewicz | Dominik Macháček | Tomáš Musil | Jindřich Libovický
Proceedings of the Sixth Workshop on Teaching NLP

This paper presents teaching materials, particularly assignments and ideas for classroom activities, from a new course on large language modelsThe assignments include experiments with LLM inference for weather report generation and machine translation.The classroom activities include class quizzes, focused research on downstream tasks and datasets, and an interactive “best paper” session aimed at reading and comprehension of research papers.

2023

pdf
Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages
Tomasz Limisiewicz | Jiří Balhar | David Mareček
Findings of the Association for Computational Linguistics: ACL 2023

Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks (POS, dependency tree labeling). In contrast, NER and sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing vocabulary. We also observe that the coverage of the language-specific tokens in the multilingual vocabulary significantly impacts the word-level tasks. Our study offers a deeper understanding of the role of tokenizers in multilingual language models and guidelines for future model developers to choose the most suitable tokenizer for their specific application before undertaking costly model pre-training.

pdf
Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation
Bar Iluz | Tomasz Limisiewicz | Gabriel Stanovsky | David Mareček
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models
Tomasz Limisiewicz | Dan Malkin | Gabriel Stanovsky
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Multilingual models have been widely used for the cross-lingual transfer to low-resource languages. However, the performance on these languages is hindered by their under-representation in the pretraining data. To alleviate this problem, we propose a novel multilingual training technique based on teacher-student knowledge distillation. In this setting, we utilize monolingual teacher models optimized for their language. We use those teachers along with balanced (sub-sampled) data to distill the teachers’ knowledge into a single multilingual student. Our method outperforms standard training methods in low-resource languages and retains performance on high-resource languages while using the same amount of data. If applied widely, our approach can increase the representation of low-resource languages in NLP systems.

pdf
ÚFAL Submission for SIGTYP Supervised Cognate Detection Task
Tomasz Limisiewicz
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

In this work, I present ÚFAL submission for the supervised task of detecting cognates and derivatives. Cognates are word pairs in different languages sharing the origin in earlier attested forms in ancestral language, while derivatives come directly from another language. For the task, I developed gradient boosted tree classifier trained on linguistic and statistical features. The solution came first from two delivered systems with an 87% F1 score on the test split. This write-up gives an insight into the system and shows the importance of using linguistic features and character-level statistics for the task.

2022

pdf
Don’t Forget About Pronouns: Removing Gender Bias in Language Models Without Losing Factual Gender Information
Tomasz Limisiewicz | David Mareček
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

The representations in large language models contain multiple types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model’s embeddings and identify components encoding both types of information with probing. We aim to diminish the stereotypical bias in the representations while preserving the factual gender signal. Our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without significant deterioration of language modeling capabilities. The findings can be applied to language generation to mitigate reliance on stereotypes while preserving gender agreement in coreferences.

pdf
A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank
Dan Malkin | Tomasz Limisiewicz | Gabriel Stanovsky
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT-based models. We inspect zero-shot performance in balanced data conditions to mitigate data size confounds, classifying pretraining languages that improve downstream performance as donors, and languages that are improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of languages to estimate these relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks. Our findings can inform developers of large-scale multilingual language models in choosing better pretraining configurations.

2021

pdf
Introducing Orthogonal Constraint in Structural Probes
Tomasz Limisiewicz | David Mareček
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

With the recent success of pre-trained models in NLP, a significant focus was put on interpreting their representations. One of the most prominent approaches is structural probing (Hewitt and Manning, 2019), where a linear projection of word embeddings is performed in order to approximate the topology of dependency structures. In this work, we introduce a new type of structural probing, where the linear projection is decomposed into 1. iso-morphic space rotation; 2. linear scaling that identifies and scales the most relevant dimensions. In addition to syntactic dependency, we evaluate our method on two novel tasks (lexical hypernymy and position in a sentence). We jointly train the probes for multiple tasks and experimentally show that lexical and syntactic information is separated in the representations. Moreover, the orthogonal constraint makes the Structural Probes less vulnerable to memorization.

pdf
Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes
Tomasz Limisiewicz | David Mareček
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

State-of-the-art contextual embeddings are obtained from large language models available only for a few languages. For others, we need to learn representations using a multilingual model. There is an ongoing debate on whether multilingual embeddings can be aligned in a space shared across many languages. The novel Orthogonal Structural Probe (Limisiewicz and Mareček, 2021) allows us to answer this question for specific linguistic features and learn a projection based only on mono-lingual annotated datasets. We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT’s contextual representations for nine diverse languages. We observe that for languages closely related to English, no transformation is needed. The evaluated information is encoded in a shared cross-lingual embedding space. For other languages, it is beneficial to apply orthogonal transformation learned separately for each language. We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.

2020

pdf
Universal Dependencies According to BERT: Both More Specific and More General
Tomasz Limisiewicz | David Mareček | Rudolf Rosa
Findings of the Association for Computational Linguistics: EMNLP 2020

This work focuses on analyzing the form and extent of syntactic abstraction captured by BERT by extracting labeled dependency trees from self-attentions. Previous work showed that individual BERT heads tend to encode particular dependency relation types. We extend these findings by explicitly comparing BERT relations to Universal Dependencies (UD) annotations, showing that they often do not match one-to-one. We suggest a method for relation identification and syntactic tree construction. Our approach produces significantly more consistent dependency trees than previous work, showing that it better explains the syntactic abstractions in BERT. At the same time, it can be successfully applied with only a minimal amount of supervision and generalizes well across languages.

pdf
Gender Coreference and Bias Evaluation at WMT 2020
Tom Kocmi | Tomasz Limisiewicz | Gabriel Stanovsky
Proceedings of the Fifth Conference on Machine Translation

Gender bias in machine translation can manifest when choosing gender inflections based on spurious gender correlations. For example, always translating doctors as men and nurses as women. This can be particularly harmful as models become more popular and deployed within commercial systems. Our work presents the largest evidence for the phenomenon in more than 19 systems submitted to the WMT over four diverse target languages: Czech, German, Polish, and Russian. To achieve this, we use WinoMT, a recent automatic test suite which examines gender coreference and bias when translating from English to languages with grammatical gender. We extend WinoMT to handle two new languages tested in WMT: Polish and Czech. We find that all systems consistently use spurious correlations in the data rather than meaningful contextual information.