Anil Kumar Singh

Also published as: Anil Kumar Singh, Anil kumar Singh

2021

pdf abs
On Reducing Repetition in Abstractive Summarization
Pranav Nair | Anil Kumar Singh
Proceedings of the Student Research Workshop Associated with RANLP 2021

Repetition in natural language generation reduces the informativeness of text and makes it less appealing. Various techniques have been proposed to alleviate it. In this work, we explore and propose techniques to reduce repetition in abstractive summarization. First, we explore the application of unlikelihood training and embedding matrix regularizers from previous work on language modeling to abstractive summarization. Next, we extend the coverage and temporal attention mechanisms to the token level to reduce repetition. In our experiments on the CNN/Daily Mail dataset, we observe that these techniques reduce the amount of repetition and increase the informativeness of the summaries, which we confirm via human evaluation.

pdf abs
Improving Abstractive Summarization with Commonsense Knowledge
Pranav Nair | Anil Kumar Singh
Proceedings of the Student Research Workshop Associated with RANLP 2021

Large scale pretrained models have demonstrated strong performances on several natural language generation and understanding benchmarks. However, introducing commonsense into them to generate more realistic text remains a challenge. Inspired from previous work on commonsense knowledge generation and generative commonsense reasoning, we introduce two methods to add commonsense reasoning skills and knowledge into abstractive summarization models. Both methods beat the baseline on ROUGE scores, demonstrating the superiority of our models over the baseline. Human evaluation results suggest that summaries generated by our methods are more realistic and have fewer commonsensical errors.

2020

pdf abs
Generating Inflectional Errors for Grammatical Error Correction in Hindi
Ankur Sonawane | Sujeet Kumar Vishwakarma | Bhavana Srivastava | Anil Kumar Singh
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop

Automated grammatical error correction has been explored as an important research problem within NLP, with the majority of the work being done on English and similar resource-rich languages. Grammar correction using neural networks is a data-heavy task, with the recent state of the art models requiring datasets with millions of annotated sentences for proper training. It is difficult to find such resources for Indic languages due to their relative lack of digitized content and complex morphology, compared to English. We address this problem by generating a large corpus of artificial inflectional errors for training GEC models. Moreover, to evaluate the performance of models trained on this dataset, we create a corpus of real Hindi errors extracted from Wikipedia edits. Analyzing this dataset with a modified version of the ERRANT error annotation toolkit, we find that inflectional errors are very common in this language. Finally, we produce the initial baseline results using state of the art methods developed for English.

pdf abs
NLPRL at WNUT-2020 Task 2: ELMo-based System for Identification of COVID-19 Tweets
Rajesh Kumar Mundotiya | Rupjyoti Baruah | Bhavana Srivastava | Anil Kumar Singh
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

The Coronavirus pandemic has been a dominating news on social media for the last many months. Efforts are being made to reduce its spread and reduce the casualties as well as new infections. For this purpose, the information about the infected people and their related symptoms, as available on social media, such as Twitter, can help in prevention and taking precautions. This is an example of using noisy text processing for disaster management. This paper discusses the NLPRL results in Shared Task-2 of WNUT-2020 workshop. We have considered this problem as a binary classification problem and have used a pre-trained ELMo embedding with GRU units. This approach helps classify the tweets with accuracy as 80.85% and 78.54% as F1-score on the provided test dataset. The experimental code is available online.

pdf abs
Transformer-based Neural Machine Translation System for Hindi – Marathi: WMT20 Shared Task
Amit Kumar | Rupjyoti Baruah | Rajesh Kumar Mundotiya | Anil Kumar Singh
Proceedings of the Fifth Conference on Machine Translation

This paper reports the results for the Machine Translation (MT) system submitted by the NLPRL team for the Hindi – Marathi Similar Translation Task at WMT 2020. We apply the Transformer-based Neural Machine Translation (NMT) approach on both translation directions for this language pair. The trained model is evaluated on the corpus provided by shared task organizers, using BLEU, RIBES, and TER scores. There were a total of 23 systems submitted for Marathi to Hindi and 21 systems submitted for Hindi to Marathi in the shared task. Out of these, our submission ranked 6th and 9th, respectively.

pdf abs
NLPRL System for Very Low Resource Supervised Machine Translation
Rupjyoti Baruah | Rajesh Kumar Mundotiya | Amit Kumar | Anil kumar Singh
Proceedings of the Fifth Conference on Machine Translation

This paper describes the results of the system that we used for the WMT20 very low resource (VLR) supervised MT shared task. For our experiments, we use a byte-level version of BPE, which requires a base vocabulary of size 256 only. BPE based models are a kind of sub-word models. Such models try to address the Out of Vocabulary (OOV) word problem by performing word segmentation so that segments correspond to morphological units. They are also reported to work across different languages, especially similar languages due to their sub-word nature. Based on BLEU cased score, our NLPRL systems ranked ninth for HSB to GER and tenth in GER to HSB translation scenario.

pdf abs
Unsupervised Approach for Zero-Shot Experiments: Bhojpuri–Hindi and Magahi–Hindi@LoResMT 2020
Amit Kumar | Rajesh Kumar Mundotiya | Anil Kumar Singh
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

This paper reports a Machine Translation (MT) system submitted by the NLPRL team for the Bhojpuri–Hindi and Magahi–Hindi language pairs at LoResMT 2020 shared task. We used an unsupervised domain adaptation approach that gives promising results for zero or extremely low resource languages. Task organizers provide the development and the test sets for evaluation and the monolingual data for training. Our approach is a hybrid approach of domain adaptation and back-translation. Metrics used to evaluate the trained model are BLEU, RIBES, Precision, Recall and F-measure. Our approach gives relatively promising results, with a wide range, of 19.5, 13.71, 2.54, and 3.16 BLEU points for Bhojpuri to Hindi, Magahi to Hindi, Hindi to Bhojpuri and Hindi to Magahi language pairs, respectively.

pdf
Attention-based Domain Adaption Using Transfer Learning for Part-of-Speech Tagging: An Experiment on the Hindi language
Rajesh Kumar Mundotiya | Vikrant Kumar | Arpit Mehta | Anil Kumar Singh
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf abs
Parsing Indian English News Headlines
Samapika Roy | Sukhada Sukhada | Anil Kumar Singh
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Parsing news Headlines is one of the difficult tasks of Natural Language Processing. It is mostly because news Headlines (NHs) are not complete grammatical sentences. News editors use all sorts of tricks to grab readers’ attention, for instance, unusual capitalization as in the headline’ Ear SHOT ashok rajagopalan’; some are world knowledge demanding like ‘Church reformation celebrated’ where the ‘Church reformation’ refers to a historical event and not a piece of news about an ordinary church. The lack of transparency in NHs can be linguistic, cultural, social, or contextual. The lack of space provided for a news headline has led to creative liberty. Though many works like news value extraction, summary generation, emotion classification of NHs have been going on, parsing them had been a tough challenge. Linguists have also been interested in NHs for creativity in the language used by bending traditional grammar rules. Researchers have conducted studies on news reportage, discourse analysis of NHs, and many more. While the creativity seen in NHs is fascinating for language researchers, it poses a computational challenge for Natural Language Processing researchers. This paper presents an outline of the ongoing doctoral research on the parsing of Indian English NHs. The ultimate aim of this research is to provide a module that will generate correctly parsed NHs. The intention is to enhance the broad applicability of newspaper corpus for future Natural Language Processing applications.

2019

pdf abs
NLPRL at WAT2019: Transformer-based Tamil – English Indic Task Neural Machine Translation System
Amit Kumar | Anil Kumar Singh
Proceedings of the 6th Workshop on Asian Translation

This paper describes the Machine Translation system for Tamil-English Indic Task organized at WAT 2019. We use Transformer- based architecture for Neural Machine Translation.

2018

pdf
Experiments on Morphological Reinflection: CoNLL-2018 Shared Task
Rishabh Jain | Anil Kumar Singh
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf abs
How emotional are you? Neural Architectures for Emotion Intensity Prediction in Microblogs
Devang Kulshreshtha | Pranav Goel | Anil Kumar Singh
Proceedings of the 27th International Conference on Computational Linguistics

Social media based micro-blogging sites like Twitter have become a common source of real-time information (impacting organizations and their strategies, and are used for expressing emotions and opinions. Automated analysis of such content therefore rises in importance. To this end, we explore the viability of using deep neural networks on the specific task of emotion intensity prediction in tweets. We propose a neural architecture combining convolutional and fully connected layers in a non-sequential manner - done for the first time in context of natural language based tasks. Combined with lexicon-based features along with transfer learning, our model achieves state-of-the-art performance, outperforming the previous system by 0.044 or 4.4% Pearson correlation on the WASSA’17 EmoInt shared task dataset. We investigate the performance of deep multi-task learning models trained for all emotions at once in a unified architecture and get encouraging results. Experiments performed on evaluating correlation between emotion pairs offer interesting insights into the relationship between them.

pdf abs
Di-LSTM Contrast : A Deep Neural Network for Metaphor Detection
Krishnkant Swarnkar | Anil Kumar Singh
Proceedings of the Workshop on Figurative Language Processing

The contrast between the contextual and general meaning of a word serves as an important clue for detecting its metaphoricity. In this paper, we present a deep neural architecture for metaphor detection which exploits this contrast. Additionally, we also use cost-sensitive learning by re-weighting examples, and baseline features like concreteness ratings, POS and WordNet-based features. The best performing system of ours achieves an overall F1 score of 0.570 on All POS category and 0.605 on the Verbs category at the Metaphor Shared Task 2018.

pdf abs
IIT (BHU) Submission for the ACL Shared Task on Named Entity Recognition on Code-switched Data
Shashwat Trivedi | Harsh Rangwani | Anil Kumar Singh
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

This paper describes the best performing system for the shared task on Named Entity Recognition (NER) on code-switched data for the language pair Spanish-English (ENG-SPA). We introduce a gated neural architecture for the NER task. Our final model achieves an F1 score of 63.76%, outperforming the baseline by 10%.

pdf abs
IIT (BHU) System for Indo-Aryan Language Identification (ILI) at VarDial 2018
Divyanshu Gupta | Gourav Dhakad | Jayprakash Gupta | Anil Kumar Singh
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

Text language Identification is a Natural Language Processing task of identifying and recognizing a given language out of many different languages from a piece of text. This paper describes our submission to the ILI 2018 shared-task, which includes the identification of 5 closely related Indo-Aryan languages. We developed a word-level LSTM(Long Short-term Memory) model, a specific type of Recurrent Neural Network model, for this task. Given a sentence, our model embeds each word of the sentence and convert into its trainable word embedding, feeds them into our LSTM network and finally predict the language. We obtained an F1 macro score of 0.836, ranking 5th in the task.

pdf abs
Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture
Soumil Mandal | Anil Kumar Singh
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

An accurate language identification tool is an absolute necessity for building complex NLP systems to be used on code-mixed data. Lot of work has been recently done on the same, but there’s still room for improvement. Inspired from the recent advancements in neural network architectures for computer vision tasks, we have implemented multichannel neural networks combining CNN and LSTM for word level language identification of code-mixed data. Combining this with a Bi-LSTM-CRF context capture module, accuracies of 93.28% and 93.32% is achieved on our two testing sets.

pdf abs
NLPRL-IITBHU at SemEval-2018 Task 3: Combining Linguistic Features and Emoji pre-trained CNN for Irony Detection in Tweets
Harsh Rangwani | Devang Kulshreshtha | Anil Kumar Singh
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes our participation in SemEval 2018 Task 3 on Irony Detection in Tweets. We combine linguistic features with pre-trained activations of a neural network. The CNN is trained on the emoji prediction task. We combine the two feature sets and feed them into an XGBoost Classifier for classification. Subtask-A involves classification of tweets into ironic and non-ironic instances whereas Subtask-B involves classification of the tweet into - non-ironic, verbal irony, situational irony or other verbal irony. It is observed that combining features from these two different feature spaces improves our system results. We leverage the SMOTE algorithm to handle the problem of class imbalance in Subtask-B. Our final model achieves an F1-score of 0.65 and 0.47 on Subtask-A and Subtask-B respectively. Our system ranks 4th on both tasks respectively, outperforming the baseline by 6% on Subtask-A and 14% on Subtask-B.

2017

pdf
Experiments on Morphological Reinflection: CoNLL-2017 Shared Task
Akhilesh Sudhakar | Anil Kumar Singh
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

pdf abs
IJCNLP-2017 Task 3: Review Opinion Diversification (RevOpiD-2017)
Anil Kumar Singh | Avijit Thawani | Mayank Panchal | Anubhav Gupta | Julian McAuley
Proceedings of the IJCNLP 2017, Shared Tasks

Unlike Entity Disambiguation in web search results, Opinion Disambiguation is a relatively unexplored topic. RevOpiD shared task at IJCNLP-2107 aimed to attract attention towards this research problem. In this paper, we summarize the first run of this task and introduce a new dataset that we have annotated for the purpose of evaluating Opinion Mining, Summarization and Disambiguation methods.

pdf abs
IIT (BHU): System Description for LSDSem’17 Shared Task
Pranav Goel | Anil Kumar Singh
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

This paper describes an ensemble system submitted as part of the LSDSem Shared Task 2017 - the Story Cloze Test. The main conclusion from our results is that an approach based on semantic similarity alone may not be enough for this task. We test various approaches and compare them with two ensemble systems. One is based on voting and the other on logistic regression based classifier. Our final system is able to outperform the previous state of the art for the Story Cloze test. Another very interesting observation is the performance of sentiment based approach which works almost as well on its own as our final ensemble system.

pdf
Word Transduction for Addressing the OOV Problem in Machine Translation for Similar Resource-Scarce Languages
Shashikant Sharma | Anil Kumar Singh
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

pdf
Reference Scope Identification for Citances Using Convolutional Neural Networks
Saurav Jha | Aanchal Chaurasia | Akhilesh Sudhakar | Anil Kumar Singh
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
Neural Morphological Disambiguation Using Surface and Contextual Morphological Awareness
Akhilesh Sudhakar | Anil Kumar Singh
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

A treebank is an important resource for developing many NLP based tools. Errors in the treebank may lead to error in the tools that use it. It is essential to ensure the quality of a treebank before it can be deployed for other purposes. Automatic (or semi-automatic) detection of errors in the treebank can reduce the manual work required to find and remove errors. Usually, the errors found automatically are manually corrected by the annotators. There is not much work reported so far on error correction tools which helps the annotators in correcting errors efficiently. In this paper, we present such an error correction tool that is an extension of the error detection method described earlier (Ambati et al., 2010; Ambati et al., 2011; Agarwal et al., 2012).

pdf abs
A Concise Query Language with Search and Transform Operations for Corpora with Multiple Levels of Annotation
Anil Kumar Singh
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The usefulness of annotated corpora is greatly increased if there is an associated tool that can allow various kinds of operations to be performed in a simple way. Different kinds of annotation frameworks and many query languages for them have been proposed, including some to deal with multiple layers of annotation. We present here an easy to learn query language for a particular kind of annotation framework based on threaded trees', which are somewhere between the complete order of a tree and the anarchy of a graph. Through 'typed' threads, they can allow multiple levels of annotation in the same document. Our language has a simple, intuitive and concise syntax and high expressive power. It allows not only to search for complicated patterns with short queries but also allows data manipulation and specification of arbitrary return values. Many of the commonly used tasks that otherwise require writing programs, can be performed with one or more queries. We compare the language with some others and try to evaluate it.

2010

pdf
Transliteration as Alignment vs. Transliteration as Generation for Crosslingual Information Retrieval
Anil Kumar Singh | Sethuramalingam Subramaniam | Taraka Rama
Traitement Automatique des Langues, Volume 51, Numéro 2 : Multilinguisme et traitement automatique des langues [Multilingualism and Natural Language Processing]

pdf abs
An Integrated Digital Tool for Accessing Language Resources
Anil Kumar Singh | Bharat Ram Ambati
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Language resources can be classified under several categories. To be able to query and operate on all (or most of) these categories using a single digital tool would be very helpful for a large number of researchers working on languages. We describe such a tool in this paper. It is different from other such tools in that it allows querying and transformation on different kinds of resources (such as corpora, lexicon and language models) with the same framework. Search options can be given based on the kind of resource being queried. It is possible to select a matched resource and open it for editing in the specialized interfaces with which that resource is associated. The tool also allows the extracted or modified data to be saved separately, apart from having the usual facilities like displaying the results in KeyWord-In-Context (KWIC) format. We also present the notation used for querying and transformation, which is comparable to but different from the Corpus Query Language (CQL).

Grammars play an important role in many Natural Language Processing (NLP) applications. The traditional approach to creating grammars manually, besides being labor-intensive, has several limitations. With the availability of large scale syntactically annotated treebanks, it is now possible to automatically extract an approximate grammar of a language in any of the existing formalisms from a corresponding treebank. In this paper, we present a basic approach to extract grammars from dependency treebanks of two Indian languages, Hindi and Telugu. The process of grammar extraction requires a generalization mechanism. Towards this end, we explore an approach which relies on generalization of argument structure over the verbs based on their syntactic similarity. Such a generalization counters the effect of data sparseness in the treebanks. A grammar extracted using this system can not only expand already existing knowledge bases for NLP tasks such as parsing, but also aid in the creation of grammars for languages where none exist. Further, we show that the grammar extraction process can help in identifying annotation errors and thus aid in the task of the treebank validation.

2009

pdf
Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training
Taraka Rama | Anil Kumar Singh | Sudheer Kolachina
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium

pdf
From Bag of Languages to Family Trees From Noisy Corpus
Taraka Rama | Anil Kumar Singh
Proceedings of the International Conference RANLP-2009

2008

pdf
A More Discerning and Adaptable Multilingual Transliteration Mechanism for Indian Languages
Harshit Surana | Anil Kumar Singh
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf
A Mechanism to Provide Language-Encoding Support and an NLP Friendly Editor
Anil Kumar Singh
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf
Natural Language Processing for Less Privileged Languages: Where do we come from? Where are we going?
Anil Kumar Singh
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

pdf
Named Entity Recognition for South and South East Asian Languages: Taking Stock
Anil Kumar Singh
Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages

pdf abs
Estimating the Resource Adaption Cost from a Resource Rich Language to a Similar Resource Poor Language
Anil Kumar Singh | Kiran Pala | Harshit Surana
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Developing resources which can be used for Natural Language Processing is an extremely difficult task for any language, but is even more so for less privileged (or less computerized) languages. One way to overcome this difficulty is to adapt the resources of a linguistically close resource rich language. In this paper we discuss how the cost of such adaption can be estimated using subjective and objective measures of linguistic similarity for allocating financial resources, time, manpower etc. Since this is the first work of its kind, the method described in this paper should be seen as only a preliminary method, indicative of how better methods can be developed. Corpora of several less computerized languages had to be collected for the work described in the paper, which was difficult because for many of these varieties there is not much electronic data available. Even if it is, it is in non-standard encodings, which means that we had to build encoding converters for these varieties. The varieties we have focused on are some of the varieties spoken in the South Asian region.