Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data

Vamshi Krishna Srirangam, Appidi Abhinav Reddy, Vinay Singh, Manish Shrivastava


Abstract
Named Entity Recognition(NER) is one of the important tasks in Natural Language Processing(NLP) and also is a subtask of Information Extraction. In this paper we present our work on NER in Telugu-English code-mixed social media data. Code-Mixing, a progeny of multilingualism is a way in which multilingual people express themselves on social media by using linguistics units from different languages within a sentence or speech context. Entity Extraction from social media data such as tweets(twitter) is in general difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. We present a Telugu-English code-mixed corpus with the corresponding named entity tags. The named entities used to tag data are Person(‘Per’), Organization(‘Org’) and Location(‘Loc’). We experimented with the machine learning models Conditional Random Fields(CRFs), Decision Trees and BiLSTMs on our corpus which resulted in a F1-score of 0.96, 0.94 and 0.95 respectively.
Anthology ID:
P19-2025
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Fernando Alva-Manchego, Eunsol Choi, Daniel Khashabi
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
183–189
Language:
URL:
https://aclanthology.org/P19-2025
DOI:
10.18653/v1/P19-2025
Bibkey:
Cite (ACL):
Vamshi Krishna Srirangam, Appidi Abhinav Reddy, Vinay Singh, and Manish Shrivastava. 2019. Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 183–189, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data (Srirangam et al., ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-bitext-workshop/P19-2025.pdf