Shikhar Kumar Sarma

Also published as: Shikhar Kumar Sarma


2023

pdf
PoS to UPoS Conversion and Creation of UPoS Tagged Resources for Assamese Language
Kuwali Talukdar | Shikhar Kumar Sarma
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

This paper addresses the vital task of transitioning from traditional Part-of-Speech (PoS) tagging to Universal Part-of-Speech (UPoS) tagging within the linguistic context of the Assamese language. The paper outlines a comprehensive methodology for PoS to UPoS conversion and the creation of UPoS tagged resources, bridging the gap between localized linguistic analysis and universal standards. The significance of this work lies in its potential to enhance natural language processing and understanding for the Assamese language, contributing to broader multilingual applications. The paper details the data preparation and creation processes, annotation methods, and evaluation techniques, shedding light on the challenges and opportunities presented in the pursuit of linguistic universality. The contents of this research have implications for improving language technology in the Assamese language and can serve as a model for similar work in other regional languages. Mapping of standard PoS tagset applicable for Indian languages to that of the primary categories of the UPoS tagset is done with respect to the Assamese language lexical behaviour. Conversion of PoS tagged text corpus to UPoS taged corpus using this mapping, and then utilizing a Deep Learning based model trained on such a dataset to create a sizable UPoS tagged corpus, are presented in a structured flow. This paper is a step towards a more standardized, universal understanding of linguistic elements in a diverse and multilingual world.

pdf
Parts of Speech (PoS) and Universal Parts of Speech (UPoS) Tagging: A Critical Review with Special Reference to Low Resource Languages
Kuwali Talukdar | Shikhar Kumar Sarma | Manash Pratim Bhuyan
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Universal Parts of Speech (UPoS) tags are parts of speech annotations used in Universal Dependencies. Universal Dependency (UD) helps in developing cross-linguistically consistent treebank annotations for multiple languages with a common framework and standard. For various Natural Language Processing (NLP) tasks and research such as semantic parsing, syntactic parsing as well as linguistic parsing, UD treebanks are becoming increasingly important resources. A lot of interest has been seen in adopting UD and UPoS standards and resources for integrating with various NLP techniques, including Machine Translations, Question Answering, Sentiment Analysis etc. Consequently, a wide variety of Artificial Intelligence (AI) and NLP tools are being created with UD and UPoS standards on board. Part of Speech (PoS) tagging is one of the fundamental NLP tasks, which labels a specific sentence or set of words in a paragraph with lexical and grammatical annotations, based on the context of the sentence. Contemporary Machine Learning (ML) and Deep Learning (DL) techniques require god quality tagged resources for training potential tagger models. Low resource languages face serious challenges here. This paper discusses about the UPoS in UD and presents a concise yet inclusive piece of literature regarding UPoS, PoS, and various taggers for multiple languages with special reference to various low resource languages. Already adopted approaches and models developed for different low resource languages are included in this review, considering representations from a wide variety of languages. Also, the study offers a comprehensive classification based on the well-known ML and DL techniques used in the development of part-of-speech taggers. This will serve as a ready-reference for understanding nuances of PoS and UPoS tagging.

pdf
Neural Machine Translation for Assamese-Bodo, a Low Resourced Indian Language Pair
Kuwali Talukdar | Shikhar Kumar Sarma | Farha Naznin | Kishore Kashyap | Mazida Akhtara Ahmed | Parvez Aziz Boruah
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Impressive results have been reported in various works related to low resource languages, using Neural Machine Translation (NMT), where size of parallel dataset is relatively low. This work presents the experiment of Machine Translation in the low resource Indian language pair AssameseBodo, with a relatively low amount of parallel data. Tokenization of raw data is done with IndicNLP tool. NMT model is trained with preprocessed dataset, and model performances have been observed with varying hyper parameters. Experiments have been completed with Vocab Size 8000 and 16000. Significant increase in BLEU score has been observed in doubling the Vocab size. Also data size increase has contributed to enhanced overall performances. BLEU scores have been recorded with training on a data set of 70000 parallel sentences, and the results are compared with another round of training with a data set enhanced with 11500 Wordnet parallel data. A gold standard test data set of 500 sentence size has been used for recording BLEU. First round reported an overall BLEU of 4.0, with vocab size of 8000. With same vocab size, and Wordnet enhanced dataset, BLEU score of 4.33 was recorded. Significant increase of BLEU score (6.94) has been observed with vocab size of 16000. Next round of experiment was done with additional 7000 new data, and filtering the entire dataset. New BLEU recorded was 9.68, with 16000 vocab size. Cross validation has also been designed and performed with an experiment with 8-fold data chunks prepared on 80K total dataset. Impressive BLEU scores of (Fold-1 through fold-8) 18.12, 16.28, 18.90, 19.25, 19.60, 18.43, 16.28, and 7.70 have been recorded. The 8th fold BLEU deviated from the trend, might be because of nonhomogeneous last fold data.

2022

pdf
BERT-based Language Identification in Code-Mix Kannada-English Text at the CoLI-Kanglish Shared Task@ICON 2022
Pritam Deka | Nayan Jyoti Kalita | Shikhar Kumar Sarma
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts

Language identification has recently gained research interest in code-mixed languages due to the extensive use of social media among people. People who speak multiple languages tend to use code-mixed languages when communicating with each other. It has become necessary to identify the languages in such code-mixed environment to detect hate speeches, fake news, misinformation or disinformation and for tasks such as sentiment analysis. In this work, we have proposed a BERT-based approach for language identification in the CoLI-Kanglish shared task at ICON 2022. Our approach achieved 86% weighted average F-1 score and a macro average F-1 score of 57% in the test set.

2019

pdf
Development of Assamese Rule based Stemmer using WordNet
Jumi Sarmah | Shikhar Kumar Sarma | Anup Kumar Barman
Proceedings of the 10th Global Wordnet Conference

Stemming is a technique that reduces any inflected word to its root form. Assamese is a morphologically rich, scheduled Indian language. There are various forms of suffixes applied to a word in various contexts. Such inflected words if normalized will help improve the performance of various Natural Language Processing applications. This paper basically tries to develop a Look-up and rule-based suffix stripping approach for the Assamese language using WordNet. The authors prepare the dictionary with the root words extracted from Assamese WordNet and Named Entities. Appropriate stemming rules for the inflected nouns, verbs have been set to the rule engine and later tested the stemmed output with the morphological root words of Assamese WordNet and Named Entities by computing hamming distance. This developed stemmer for the Assamese language achieves accuracy of 85%. Also, the authors reported the IR system’s performance on applying the Assamese stemmer and proved its efficiency by retrieving sense oriented results based on the fired query. Thus, Morphological Analyzer will embark the research wing for developing various Assamese NLP applications.