CoLI-Kanglish: Word-Level Language Identification in Code-Mixed Kannada-English Texts Shared Task using the Distilka model

Vajratiya Vajrobol

CoLI-Kanglish: Word-Level Language Identification in Code-Mixed Kannada-English Texts Shared Task using the Distilka model

Abstract

Due to the intercultural demographic of online users, code-mixed language is often used by them to express themselves on social media. Language support to such users is based on the ability of a system to identify the constituent languages of the code-mixed language. Therefore, the process of language identification that helps in determining the language of individual textual entities from a code-mixed corpus is a current and relevant classification problem. Code-mixed texts are difficult to interpret and analyze from an algorithmic perspective. However, highly complex transformer- based techniques can be used to analyze and identify distinct languages of words in code-mixed texts. Kannada is one of the Dravidian languages which is spoken and written in Karnataka, India. This study aims to identify the language of individual words of texts from a corpus of code-mixed Kannada-English texts using transformer-based techniques. The proposed Distilka model was developed by fine-tuning the DistilBERT model using the code-mixed corpus. This model performed best on the official test dataset with a macro-averaged F1-score of 0.62 and weighted precision score of 0.86. The proposed solution ranked first in the shared task.

Anthology ID:: 2022.icon-wlli.2
Volume:: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
Month:: December
Year:: 2022
Address:: IIIT Delhi, New Delhi, India
Editors:: Bharathi Raja Chakravarthi, Abirami Murugappan, Dhivya Chinnappa, Adeep Hane, Prasanna Kumar Kumeresan, Rahul Ponnusamy
Venue:: ICON
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7–11
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2022.icon-wlli.2/
DOI:
Bibkey:
Cite (ACL):: Vajratiya Vajrobol. 2022. CoLI-Kanglish: Word-Level Language Identification in Code-Mixed Kannada-English Texts Shared Task using the Distilka model. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, pages 7–11, IIIT Delhi, New Delhi, India. Association for Computational Linguistics.
Cite (Informal):: CoLI-Kanglish: Word-Level Language Identification in Code-Mixed Kannada-English Texts Shared Task using the Distilka model (Vajrobol, ICON 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2022.icon-wlli.2.pdf

PDF Cite Search Fix data