Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding

O. E. Ojo, A. Gelbukh, H. Calvo, A. Feldman, O. O. Adebanji, J. Armenta-Segura

[How to correct problems with metadata yourself]


Abstract
People often switch languages in conversations or written communication in order to communicate thoughts on social media platforms. The languages in texts of this type, also known as code-mixed texts, can be mixed at the sentence, word, or even sub-word level. In this paper, we address the problem of identifying language at the word level in code-mixed texts using a sequence of characters and word embedding. We feed machine learning and deep neural networks with a range of character-based and word-based text features as input. The data for this experiment was created by combining YouTube video comments from code-mixed Kannada and English (Kn-En) texts. The texts were pre-processed, split into words, and categorized as ‘Kannada’, ‘English’, ‘Mixed-Language’, ‘Name’, ‘Location’, and ‘Other’. The proposed techniques were able to learn from these features and were able to effectively identify the language of the words in the dataset. The proposed CK-Keras model with pre-trained Word2Vec embedding was our best-performing system, as it outperformed other methods when evaluated by the F1 scores.
Anthology ID:
2022.icon-wlli.1
Volume:
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
Month:
December
Year:
2022
Address:
IIIT Delhi, New Delhi, India
Editors:
Bharathi Raja Chakravarthi, Abirami Murugappan, Dhivya Chinnappa, Adeep Hane, Prasanna Kumar Kumeresan, Rahul Ponnusamy
Venue:
ICON
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–6
Language:
URL:
https://aclanthology.org/2022.icon-wlli.1
DOI:
Bibkey:
Cite (ACL):
O. E. Ojo, A. Gelbukh, H. Calvo, A. Feldman, O. O. Adebanji, and J. Armenta-Segura. 2022. Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding. In Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, pages 1–6, IIIT Delhi, New Delhi, India. Association for Computational Linguistics.
Cite (Informal):
Language Identification at the Word Level in Code-Mixed Texts Using Character Sequence and Word Embedding (E. Ojo et al., ICON 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/2022.icon-wlli.1.pdf