Abstract
Language identification has recently gained research interest in code-mixed languages due to the extensive use of social media among people. People who speak multiple languages tend to use code-mixed languages when communicating with each other. It has become necessary to identify the languages in such code-mixed environment to detect hate speeches, fake news, misinformation or disinformation and for tasks such as sentiment analysis. In this work, we have proposed a BERT-based approach for language identification in the CoLI-Kanglish shared task at ICON 2022. Our approach achieved 86% weighted average F-1 score and a macro average F-1 score of 57% in the test set.- Anthology ID:
- 2022.icon-wlli.3
- Volume:
- Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
- Month:
- December
- Year:
- 2022
- Address:
- IIIT Delhi, New Delhi, India
- Editors:
- Bharathi Raja Chakravarthi, Abirami Murugappan, Dhivya Chinnappa, Adeep Hane, Prasanna Kumar Kumeresan, Rahul Ponnusamy
- Venue:
- ICON
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 12–17
- Language:
- URL:
- https://aclanthology.org/2022.icon-wlli.3
- DOI:
- Cite (ACL):
- Pritam Deka, Nayan Jyoti Kalita, and Shikhar Kumar Sarma. 2022. BERT-based Language Identification in Code-Mix Kannada-English Text at the CoLI-Kanglish Shared Task@ICON 2022. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, pages 12–17, IIIT Delhi, New Delhi, India. Association for Computational Linguistics.
- Cite (Informal):
- BERT-based Language Identification in Code-Mix Kannada-English Text at the CoLI-Kanglish Shared Task@ICON 2022 (Deka et al., ICON 2022)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/2022.icon-wlli.3.pdf