Normalization and Back-Transliteration for Code-Switched Data

Dwija Parikh; Thamar Solorio

doi:10.18653/v1/2021.calcs-1.15

Normalization and Back-Transliteration for Code-Switched Data

Abstract

Code-switching is an omnipresent phenomenon in multilingual communities all around the world but remains a challenge for NLP systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Roman script which prevents from utilizing monolingual resources available in the native Devanagari script. In this paper, we propose a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. We also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research.

Anthology ID:: 2021.calcs-1.15
Volume:: Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Month:: June
Year:: 2021
Address:: Online
Editors:: Thamar Solorio, Shuguang Chen, Alan W. Black, Mona Diab, Sunayana Sitaram, Victor Soto, Emre Yilmaz, Anirudh Srinivasan
Venue:: CALCS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 119–124
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2021.calcs-1.15/
DOI:: 10.18653/v1/2021.calcs-1.15
Bibkey:
Cite (ACL):: Dwija Parikh and Thamar Solorio. 2021. Normalization and Back-Transliteration for Code-Switched Data. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 119–124, Online. Association for Computational Linguistics.
Cite (Informal):: Normalization and Back-Transliteration for Code-Switched Data (Parikh & Solorio, CALCS 2021)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2021.calcs-1.15.pdf

PDF Cite Search Fix data