Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding

O. E. Ojo; A. Gelbukh; H. Calvo; A. Feldman; O. O. Adebanji; J. Armenta-Segura

Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding

O. E. Ojo, A. Gelbukh, H. Calvo, A. Feldman, O. O. Adebanji, J. Armenta-Segura

[How to correct problems with metadata yourself]

Abstract

People often switch languages in conversations or written communication in order to communicate thoughts on social media platforms. The languages in texts of this type, also known as code-mixed texts, can be mixed at the sentence, word, or even sub-word level. In this paper, we address the problem of identifying language at the word level in code-mixed texts using a sequence of characters and word embedding. We feed machine learning and deep neural networks with a range of character-based and word-based text features as input. The data for this experiment was created by combining YouTube video comments from code-mixed Kannada and English (Kn-En) texts. The texts were pre-processed, split into words, and categorized as ‘Kannada’, ‘English’, ‘Mixed-Language’, ‘Name’, ‘Location’, and ‘Other’. The proposed techniques were able to learn from these features and were able to effectively identify the language of the words in the dataset. The proposed CK-Keras model with pre-trained Word2Vec embedding was our best-performing system, as it outperformed other methods when evaluated by the F1 scores.

Anthology ID:: 2022.icon-wlli.1
Volume:: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
Month:: December
Year:: 2022
Address:: IIIT Delhi, New Delhi, India
Editors:: Bharathi Raja Chakravarthi, Abirami Murugappan, Dhivya Chinnappa, Adeep Hane, Prasanna Kumar Kumeresan, Rahul Ponnusamy
Venue:: ICON
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–6
Language:
URL:: https://aclanthology.org/2022.icon-wlli.1
DOI:
Bibkey:
Cite (ACL):: O. E. Ojo, A. Gelbukh, H. Calvo, A. Feldman, O. O. Adebanji, and J. Armenta-Segura. 2022. Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, pages 1–6, IIIT Delhi, New Delhi, India. Association for Computational Linguistics.
Cite (Informal):: Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding (E. Ojo et al., ICON 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/teach-a-man-to-fish/2022.icon-wlli.1.pdf

PDF Cite Search