TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers
Suma Reddy Duggenpudi, Subba Reddy Oota, Mounika Marreddy, Radhika Mamidi
Abstract
Named Entity Recognition (NER) is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER during recent times. With growing data in different online platforms, there is a need for NER in other languages too. NER remains to be underexplored in Indian languages due to the lack of resources and tools. Our contributions in this paper include (i) Two annotated NER datasets for the Telugu language in multiple domains: Newswire Dataset (ND) and Medical Dataset (MD), and we combined ND and MD to form Combined Dataset (CD) (ii) Comparison of the finetuned Telugu pretrained transformer models (BERT-Te, RoBERTa-Te, and ELECTRA-Te) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF) (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT, XLM-R, and IndicBERT. We find that pretrained Telugu language models (BERT-Te and RoBERTa) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the BERT-Te achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various existing Telugu NER datasets. We open-source our dataset, pretrained models, and code.- Anthology ID:
- 2022.acl-srw.20
- Volume:
- Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 262–272
- Language:
- URL:
- https://aclanthology.org/2022.acl-srw.20
- DOI:
- 10.18653/v1/2022.acl-srw.20
- Cite (ACL):
- Suma Reddy Duggenpudi, Subba Reddy Oota, Mounika Marreddy, and Radhika Mamidi. 2022. TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 262–272, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers (Duggenpudi et al., ACL 2022)
- PDF:
- https://preview.aclanthology.org/nodalida-main-page/2022.acl-srw.20.pdf
- Data
- WikiAnn