Part of Speech Tagging for a Resource Poor Language : Sindhi in Devanagari Script using HMM and CRF

Bharti Nathani, Nisheeth Joshi


Abstract
Part of speech tagging is a pre-processing step of various NLP applications. Mainly it is used in Machine Translation. This research proposes two POS taggers, i.e., an HMM-based and CRF based tagger. To develop this tagger, the corpus of manually annotated 30,000 sentences has been prepared with the help of language experts. In this paper, we have developed POS taggers for Sindhi Language (in Devanagari Script), a resource poor language, using HMM (Hidden Markov Model) and Conditional Random Field (CRF).Evaluation results demonstrated the accuracies of 76.60714% and 88.79% in the HMM, and CRF, respectively.
Anthology ID:
2021.icon-main.75
Volume:
Proceedings of the 18th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2021
Address:
National Institute of Technology Silchar, Silchar, India
Editors:
Sivaji Bandyopadhyay, Sobha Lalitha Devi, Pushpak Bhattacharyya
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
611–618
Language:
URL:
https://aclanthology.org/2021.icon-main.75
DOI:
Bibkey:
Cite (ACL):
Bharti Nathani and Nisheeth Joshi. 2021. Part of Speech Tagging for a Resource Poor Language : Sindhi in Devanagari Script using HMM and CRF. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 611–618, National Institute of Technology Silchar, Silchar, India. NLP Association of India (NLPAI).
Cite (Informal):
Part of Speech Tagging for a Resource Poor Language : Sindhi in Devanagari Script using HMM and CRF (Nathani & Joshi, ICON 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2021.icon-main.75.pdf