Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data

Fahad AlGhamdi, Mona Diab


Abstract
Linguistic Code Switching (CS) is a phenomenon that occurs when multilingual speakers alternate between two or more languages/dialects within a single conversation. Processing CS data is especially challenging in intra-sentential data given state-of-the-art monolingual NLP technologies since such technologies are geared toward the processing of one language at a time. In this paper, we address the problem of Part-of-Speech tagging (POS) in the context of linguistic code switching (CS). We explore leveraging multiple neural network architectures to measure the impact of different pre-trained embeddings methods on POS tagging CS data. We investigate the landscape in four CS language pairs, Spanish-English, Hindi-English, Modern Standard Arabic- Egyptian Arabic dialect (MSA-EGY), and Modern Standard Arabic- Levantine Arabic dialect (MSA-LEV). Our results show that multilingual embedding (e.g., MSA-EGY and MSA-LEV) helps closely related languages (EGY/LEV) but adds noise to the languages that are distant (SPA/HIN). Finally, we show that our proposed models outperform state-of-the-art CS taggers for MSA-EGY language pair.
Anthology ID:
W19-1410
Volume:
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
June
Year:
2019
Address:
Ann Arbor, Michigan
Editors:
Marcos Zampieri, Preslav Nakov, Shervin Malmasi, Nikola Ljubešić, Jörg Tiedemann, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
99–109
Language:
URL:
https://aclanthology.org/W19-1410
DOI:
10.18653/v1/W19-1410
Bibkey:
Cite (ACL):
Fahad AlGhamdi and Mona Diab. 2019. Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 99–109, Ann Arbor, Michigan. Association for Computational Linguistics.
Cite (Informal):
Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data (AlGhamdi & Diab, VarDial 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ml4al-ingestion/W19-1410.pdf