Multilingual Identification of English Code-Switching

Igor Sterner

Multilingual Identification of English Code-Switching

Abstract

This work addresses the task of identifying English code-switching in multilingual text. We train two token-level classifiers on data of high-resource language pairs. The first distinguishes between English, not English, morphologically mixed, and other words. The second is a binary classifier that identifies named entities. Results indicate that our system is on-par with SoTA for high-resource language pairs. Meanwhile we show that on low-resource language pairs not in the training data our system outperforms SoTA by between 2.31 and 4.59% F1. We also analyse the correlation between typological similarity of the languages and difficulty in recognizing code-switching. Our system is a new strong baseline system for code-switching research between any language and English.

Anthology ID:: 2024.vardial-1.14
Volume:: Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Marcos Zampieri, Preslav Nakov, Jörg Tiedemann
Venues:: VarDial | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 163–173
Language:
URL:: https://aclanthology.org/2024.vardial-1.14
DOI:
Bibkey:
Cite (ACL):: Igor Sterner. 2024. Multilingual Identification of English Code-Switching. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), pages 163–173, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Multilingual Identification of English Code-Switching (Sterner, VarDial-WS 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.vardial-1.14.pdf
Supplementary material:: 2024.vardial-1.14.SupplementaryMaterial.txt

PDF Search Supplementary material