Igor Sterner


2024

pdf
Multilingual Identification of English Code-Switching
Igor Sterner
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

This work addresses the task of identifying English code-switching in multilingual text. We train two token-level classifiers on data of high-resource language pairs. The first distinguishes between English, not English, morphologically mixed, and other words. The second is a binary classifier that identifies named entities. Results indicate that our system is on-par with SoTA for high-resource language pairs. Meanwhile we show that on low-resource language pairs not in the training data our system outperforms SoTA by between 2.31 and 4.59% F1. We also analyse the correlation between typological similarity of the languages and difficulty in recognizing code-switching. Our system is a new strong baseline system for code-switching research between any language and English.

2023

pdf bib
TongueSwitcher: Fine-Grained Identification of German-English Code-Switching
Igor Sterner | Simone Teufel
Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching

This paper contributes to German-English code-switching research. We provide the largest corpus of naturally occurring German-English code-switching, where English is included in German text, and two methods for code-switching identification. The first method is rule-based, using wordlists and morphological processing. We use this method to compile a corpus of 25.6M tweets employing German-English code-switching. In our second method, we continue pretraining of a neural language model on this corpus and classify tokens based on embeddings from this language model. Our systems establish SoTA on our new corpus and an existing German-English code-switching benchmark. In particular, we systematically study code-switching for language-ambiguous words which can only be resolved in context, and morphologically mixed words consisting of both English and German morphemes. We distribute both corpora and systems to the research community.