Pruthwik Mishra


Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages
Sankalp Bahad | Pruthwik Mishra | Parameswari Krishnamurthy | Dipti Sharma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Named Entity Recognition (NER) is a use-ful component in Natural Language Process-ing (NLP) applications. It is used in varioustasks such as Machine Translation, Summa-rization, Information Retrieval, and Question-Answering systems. The research on NER iscentered around English and some other ma-jor languages, whereas limited attention hasbeen given to Indian languages. We analyze thechallenges and propose techniques that can betailored for Multilingual Named Entity Recog-nition for Indian Languages. We present a hu-man annotated named entity corpora of ∼40Ksentences for 4 Indian languages from two ofthe major Indian language families. Addition-ally, we show the transfer learning capabilitiesof pre-trained transformer models from a highresource language to multiple low resource lan-guages through a series of experiments. Wealso present a multilingual model fine-tunedon our dataset, which achieves an F1 score of∼0.80 on our dataset on average. We achievecomparable performance on completely unseenbenchmark datasets for Indian languages whichaffirms the usability of our model.

LTRC-IIITH at EHRSQL 2024: Enhancing Reliability of Text-to-SQL Systems through Abstention and Confidence Thresholding
Jerrin Thomas | Pruthwik Mishra | Dipti Sharma | Parameswari Krishnamurthy
Proceedings of the 6th Clinical Natural Language Processing Workshop

In this paper, we present our work in the EHRSQL 2024 shared task which tackles reliable text-to-SQL modeling on Electronic Health Records. Our proposed system tackles the task with three modules - abstention module, text-to-SQL generation module, and reliability module. The abstention module identifies whether the question is answerable given the database schema. If the question is answerable, the text-to-SQL generation module generates the SQL query and associated confidence score. The reliability module has two key components - confidence score thresholding, which rejects generations with confidence below a pre-defined level, and error filtering, which identifies and excludes SQL queries that result in execution errors. In the official leaderboard for the task, our system ranks 6th. We have also made the source code public.

pdf bib
Towards Disfluency Annotated Corpora for Indian Languages
Chayan Kochar | Vandan Vasantlal Mujadia | Pruthwik Mishra | Dipti Misra Sharma
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

In the natural course of spoken language, individuals often engage in thinking and self-correction during speech production. These instances of interruption or correction are commonly referred to as disfluencies. When preparing data for subsequent downstream NLP tasks, these linguistic elements can be systematically removed, or handled as required, to enhance data quality. In this study, we present a comprehensive research on disfluencies in Indian languages. Our approach involves not only annotating real-world conversation transcripts but also conducting a detailed analysis of linguistic nuances inherent to Indian languages that are necessary to consider during annotation. Additionally, we introduce a robust algorithm for the synthetic generation of disfluent data. This algorithm aims to facilitate more effective model training for the identification of disfluencies in real-world conversations, thereby contributing to the advancement of disfluency research in Indian languages.


HAWP: a Dataset for Hindi Arithmetic Word Problem Solving
Harshita Sharma | Pruthwik Mishra | Dipti Sharma
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Word Problem Solving remains a challenging and interesting task in NLP. A lot of research has been carried out to solve different genres of word problems with various complexity levels in recent years. However, most of the publicly available datasets and work has been carried out for English. Recently there has been a surge in this area of word problem solving in Chinese with the creation of large benchmark datastes. Apart from these two languages, labeled benchmark datasets for low resource languages are very scarce. This is the first attempt to address this issue for any Indian Language, especially Hindi. In this paper, we present HAWP (Hindi Arithmetic Word Problems), a dataset consisting of 2336 arithmetic word problems in Hindi. We also developed baseline systems for solving these word problems. We also propose a new evaluation technique for word problem solvers taking equation equivalence into account.


pdf bib
Proceedings of the First Workshop on Parsing and its Applications for Indian Languages
Kengatharaiyer Sarveswaran | Parameswari Krishnamurthy | Pruthwik Mishra
Proceedings of the First Workshop on Parsing and its Applications for Indian Languages


Annotated Corpus for Sentiment Analysis in Odia Language
Gaurav Mohanty | Pruthwik Mishra | Radhika Mamidi
Proceedings of the Twelfth Language Resources and Evaluation Conference

Given the lack of an annotated corpus of non-traditional Odia literature which serves as the standard when it comes sentiment analysis, we have created an annotated corpus of Odia sentences and made it publicly available to promote research in the field. Secondly, in order to test the usability of currently available Odia sentiment lexicon, we experimented with various classifiers by training and testing on the sentiment annotated corpus while using identified affective words from the same as features. Annotation and classification are done at sentence level as the usage of sentiment lexicon is best suited to sentiment analysis at this level. The created corpus contains 2045 Odia sentences from news domain annotated with sentiment labels using a well-defined annotation scheme. An inter-annotator agreement score of 0.79 is reported for the corpus.

Linguistically Informed Hindi-English Neural Machine Translation
Vikrant Goyal | Pruthwik Mishra | Dipti Misra Sharma
Proceedings of the Twelfth Language Resources and Evaluation Conference

Hindi-English Machine Translation is a challenging problem, owing to multiple factors including the morphological complexity and relatively free word order of Hindi, in addition to the lack of sufficient parallel training data. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. To overcome the data sparsity issue caused by the lack of large parallel corpora for Hindi-English, we propose a method to employ additional linguistic knowledge which is encoded by different phenomena depicted by Hindi. We generalize the embedding layer of the state-of-the-art Transformer model to incorporate linguistic features like POS tag, lemma and morph features to improve the translation performance. We compare the results obtained on incorporating this knowledge with the baseline systems and demonstrate significant performance improvements. Although, the Transformer NMT models have a strong efficacy to learn language constructs, we show that the usage of specific features further help in improving the translation performance.

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task


Dataset for Aspect Detection on Mobile reviews in Hindi
Pruthwik Mishra | Ayush Joshi | Dipti Sharma
Proceedings of the 16th International Conference on Natural Language Processing

In recent years Opinion Mining has become one of the very interesting fields of Language Processing. To extract the gist of a sentence in a shorter and efficient manner is what opinion mining provides. In this paper we focus on detecting aspects for a particular domain. While relevant research work has been done in aspect detection in resource rich languages like English, we are trying to do the same in a relatively resource poor Hindi language. Here we present a corpus of mobile reviews which are labelled with carefully curated aspects. The motivation behind Aspect detection is to get information on a finer level about the data. In this paper we identify all aspects related to the gadget which are present on the reviews given online on various websites. We also propose baseline models to detect aspects in Hindi text after conducting various experiments.

Arabic Dialect Identification for Travel and Twitter Text
Pruthwik Mishra | Vandan Mujadia
Proceedings of the Fourth Arabic Natural Language Processing Workshop

This paper presents the results of the experiments done as a part of MADAR Shared Task in WANLP 2019 on Arabic Fine-Grained Dialect Identification. Dialect Identification is one of the prominent tasks in the field of Natural language processing where the subsequent language modules can be improved based on it. We explored the use of different features like char, word n-gram, language model probabilities, etc on different classifiers. Results show that these features help to improve dialect classification accuracy. Results also show that traditional machine learning classifier tends to perform better when compared to neural network models on this task in a low resource setting.


pdf bib
Automated Error Correction and Validation for POS Tagging of Hindi
Sachi Angle | Pruthwik Mishra | Dipti Mishra Sharma
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

EquGener: A Reasoning Network for Word Problem Solving by Generating Arithmetic Equations
Pruthwik Mishra | Litton J Kurisinkel | Dipti Misra Sharma | Vasudeva Varma
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation


POS Tagging For Resource Poor Languages Through Feature Projection
Pruthwik Mishra | Vandan Mujadia | Dipti Misra Sharma
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

Deep Neural Network based system for solving Arithmetic Word problems
Purvanshi Mehta | Pruthwik Mishra | Vinayak Athavale | Manish Shrivastava | Dipti Sharma
Proceedings of the IJCNLP 2017, System Demonstrations

This paper presents DILTON a system which solves simple arithmetic word problems. DILTON uses a Deep Neural based model to solve math word problems. DILTON divides the question into two parts - worldstate and query. The worldstate and the query are processed separately in two different networks and finally, the networks are merged to predict the final operation. We report the first deep learning approach for the prediction of operation between two numbers. DILTON learns to predict operations with 88.81% accuracy in a corpus of primary school questions.

IIIT-H at IJCNLP-2017 Task 3: A Bidirectional-LSTM Approach for Review Opinion Diversification
Pruthwik Mishra | Prathyusha Danda | Silpa Kanneganti | Soujanya Lanka
Proceedings of the IJCNLP 2017, Shared Tasks

The Review Opinion Diversification (Revopid-2017) shared task focuses on selecting top-k reviews from a set of reviews for a particular product based on a specific criteria. In this paper, we describe our approaches and results for modeling the ranking of reviews based on their usefulness score, this being the first of the three subtasks under this shared task. Instead of posing this as a regression problem, we modeled this as a classification task where we want to identify whether a review is useful or not. We employed a bi-directional LSTM to represent each review and is used with a softmax layer to predict the usefulness score. We chose the review with highest usefulness score, then find its cosine similarity score with rest of the reviews. This is done in order to ensure diversity in the selection of top-k reviews. On the top-5 list prediction, we finished 3rd while in top-10 list one, we are placed 2nd in the shared task. We have discussed the model and the results in detail in the paper.

IIIT-H at IJCNLP-2017 Task 4: Customer Feedback Analysis using Machine Learning and Neural Network Approaches
Prathyusha Danda | Pruthwik Mishra | Silpa Kanneganti | Soujanya Lanka
Proceedings of the IJCNLP 2017, Shared Tasks

The IJCNLP 2017 shared task on Customer Feedback Analysis focuses on classifying customer feedback into one of a predefined set of categories or classes. In this paper, we describe our approach to this problem and the results on four languages, i.e. English, French, Japanese and Spanish. Our system implemented a bidirectional LSTM (Graves and Schmidhuber, 2005) using pre-trained glove (Pennington et al., 2014) and fastText (Joulin et al., 2016) embeddings, and SVM (Cortes and Vapnik, 1995) with TF-IDF vectors for classifying the feedback data which is described in the later sections. We also tried different machine learning techniques and compared the results in this paper. Out of the 12 participating teams, our systems obtained 0.65, 0.86, 0.70 and 0.56 exact accuracy score in English, Spanish, French and Japanese respectively. We observed that our systems perform better than the baseline systems in three languages while we match the baseline accuracy for Japanese on our submitted systems. We noticed significant improvements in Japanese in later experiments, matching the highest performing system that was submitted in the shared task, which we will discuss in this paper.


Non-decreasing Sub-modular Function for Comprehensible Summarization
Litton J Kurisinkel | Pruthwik Mishra | Vigneshwaran Muralidaran | Vasudeva Varma | Dipti Misra Sharma
Proceedings of the NAACL Student Research Workshop