Igor Sterner
2024
Multilingual Identification of English Code-Switching
Igor Sterner
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
This work addresses the task of identifying English code-switching in multilingual text. We train two token-level classifiers on data of high-resource language pairs. The first distinguishes between English, not English, morphologically mixed, and other words. The second is a binary classifier that identifies named entities. Results indicate that our system is on-par with SoTA for high-resource language pairs. Meanwhile we show that on low-resource language pairs not in the training data our system outperforms SoTA by between 2.31 and 4.59% F1. We also analyse the correlation between typological similarity of the languages and difficulty in recognizing code-switching. Our system is a new strong baseline system for code-switching research between any language and English.
2023
TongueSwitcher: Fine-Grained Identification of German-English Code-Switching
Igor Sterner
|
Simone Teufel
Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching
This paper contributes to German-English code-switching research. We provide the largest corpus of naturally occurring German-English code-switching, where English is included in German text, and two methods for code-switching identification. The first method is rule-based, using wordlists and morphological processing. We use this method to compile a corpus of 25.6M tweets employing German-English code-switching. In our second method, we continue pretraining of a neural language model on this corpus and classify tokens based on embeddings from this language model. Our systems establish SoTA on our new corpus and an existing German-English code-switching benchmark. In particular, we systematically study code-switching for language-ambiguous words which can only be resolved in context, and morphologically mixed words consisting of both English and German morphemes. We distribute both corpora and systems to the research community.
Search