Dipankar Das

2022

pdf abs
JU_NLP at HinglishEval: Quality Evaluation of the Low-Resource Code-Mixed Hinglish Text
Prantik Guha | Rudra Dhar | Dipankar Das
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

In this paper we describe a system submitted to the INLG 2022 Generation Challenge (GenChal) on Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text. We implement a Bi-LSTM-based neural network model to predict the Average rating score and Disagreement score of the synthetic Hinglish dataset. In our models, we used word embeddings for English and Hindi data, and one hot encodings for Hinglish data. We achieved a F1 score of 0.11, and mean squared error of 6.0 in the average rating score prediction task. In the task of Disagreement score prediction, we achieve a F1 score of 0.18, and mean squared error of 5.0.

pdf abs
Can Unsupervised Knowledge Transfer from Social Discussions Help Argument Mining?
Subhabrata Dutta | Jeevesh Juneja | Dipankar Das | Tanmoy Chakraborty
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Identifying argument components from unstructured texts and predicting the relationships expressed among them are two primary steps of argument mining. The intrinsic complexity of these tasks demands powerful learning models. While pretrained Transformer-based Language Models (LM) have been shown to provide state-of-the-art results over different NLP tasks, the scarcity of manually annotated data and the highly domain-dependent nature of argumentation restrict the capabilities of such models. In this work, we propose a novel transfer learning strategy to overcome these challenges. We utilize argumentation-rich social discussions from the ChangeMyView subreddit as a source of unsupervised, argumentative discourse-aware knowledge by finetuning pretrained LMs on a selectively masked language modeling task. Furthermore, we introduce a novel prompt-based strategy for inter-component relation prediction that compliments our proposed finetuning method while leveraging on the discourse context. Exhaustive experiments show the generalization capability of our method on these two tasks over within-domain as well as out-of-domain datasets, outperforming several existing and employed strong baselines.

2021

pdf abs
Leveraging Expectation Maximization for Identifying Claims in Low Resource Indian Languages
Rudra Dhar | Dipankar Das
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Identification of the checkable claims is one of the important prior tasks while dealing with infinite amount of data streaming from social web and the task becomes a compulsory one when we analyze them on behalf of a multilingual country like India that contains more than 1 billion people. In the present work, we describe our system which is made for detecting check-worthy claim sentences in resource scarce Indian languages (e.g., Bengali and Hindi). Firstly, we collected sentences from various sources in Bengali and Hindi and vectorized them with several NLP features. We labeled a small portion of them for check-worthy claims manually. However, in order to label rest amount of data in a semi-supervised fashion, we employed the Expectation Maximization (EM) algorithm tuned with the Multivariate Gaussian Mixture Model (GMM) to assign weakly labels. The optimal number of Gaussians in this algorithm is traced by using Logistic Regression. Furthermore, we used different ratios of manually labeled data and weakly labeled data to train our various machine learning models. We tabulated and plotted the performances of the models along with the stepwise decrement in proportion of manually labeled data. The experimental results were at par with our theoretical understanding, and we conclude that the weakly labeling of check-worthy claim sentences in low resource languages with EM algorithm has true potential.

pdf abs
Studies Towards Language Independent Fake News Detection
Soumayan Majumder | Dipankar Das
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

We have studied that fake news is currently one of the trending topic and it causes problem to many people and organization. We use COVID19 domain and 7 languages to work on. We collect our data from twitter. We build two types of model one is language dependent and other one is language independent. We get better result in language independent model for English, Hindi and Bengali language. Results of European languages like German, Italian, French and Spanish are comparable in both language dependent and independent model.

pdf abs
Classification of COVID19 tweets using Machine Learning Approaches
Anupam Mondal | Sainik Mahata | Monalisa Dey | Dipankar Das
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

The reported work is a description of our participation in the “Classification of COVID19 tweets containing symptoms” shared task, organized by the “Social Media Mining for Health Applications (SMM4H)” workshop. The literature describes two machine learning approaches that were used to build a three class classification system, that categorizes tweets related to COVID19, into three classes, viz., self-reports, non-personal reports, and literature/news mentions. The steps for pre-processing tweets, feature extraction, and the development of the machine learning models, are described extensively in the documentation. Both the developed learning models, when evaluated by the organizers, garnered F1 scores of 0.93 and 0.92 respectively.

pdf bib abs
Classifying Emotional Utterances by Employing Multi-modal Speech Emotion Recognition
Dipankar Das
Proceedings of the Workshop on Speech and Music Processing 2021

Deep learning methods are being applied to several speech processing problems in recent years. In the present work, we have explored different deep learning models for speech emotion recognition. We have employed normal deep feedforward neural network (FFNN) and convolutional neural network (CNN) to classify audio files according to their emotional content. Comparative study indicates that CNN model outperforms FFNN in case of emotions as well as gender classification. It was observed that the sole audio based models can capture the emotions up to a certain limit. Thus, we attempted a multi-modal framework by combining the benefits of the audio and text features and employed them into a recurrent encoder. Finally, the audio and text encoders are merged to provide the desired impact on various datasets. In addition, a database consists of emotional utterances of several words has also been developed as a part of this work. It contains same word in different emotional utterances. Though the size of the database is not that large but this database is ideally supposed to contain all the English words that exist in an English dictionary.

pdf abs
Sentiment Classification of Code-Mixed Tweets using Bi-Directional RNN and Language Tags
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Sentiment analysis tools and models have been developed extensively throughout the years, for European languages. In contrast, similar tools for Indian Languages are scarce. This is because, state-of-the-art pre-processing tools like POS tagger, shallow parsers, etc., are not readily available for Indian languages. Although, such working tools for Indian languages, like Hindi and Bengali, that are spoken by the majority of the population, are available, finding the same for less spoken languages like, Tamil, Telugu, and Malayalam, is difficult. Moreover, due to the advent of social media, the multi-lingual population of India, who are comfortable with both English ad their regional language, prefer to communicate by mixing both languages. This gives rise to massive code-mixed content and automatically annotating them with their respective sentiment labels becomes a challenging task. In this work, we take up a similar challenge of developing a sentiment analysis model that can work with English-Tamil code-mixed data. The proposed work tries to solve this by using bi-directional LSTMs along with language tagging. Other traditional methods, based on classical machine learning algorithms have also been discussed in the literature, and they also act as the baseline systems to which we will compare our Neural Network based model. The performance of the developed algorithm, based on Neural Network architecture, garnered precision, recall, and F1 scores of 0.59, 0.66, and 0.58 respectively.

2020

pdf bib abs
JUNLP@ICON2020: Low Resourced Machine Translation for Indic Languages
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

In the current work, we present the description of the systems submitted to a machine translation shared task organized by ICON 2020: 17th International Conference on Natural Language Processing. The systems were developed to show the capability of general domain machine translation when translating into Indic languages, English-Hindi, in our case. The paper shows the training process and quantifies the performance of two state-of-the-art translation systems, viz., Statistical Machine Translation and Neural Machine Translation. While Statistical Machine Translation systems work better in a low-resource setting, Neural Machine Translation systems are able to generate sentences that are fluent in nature. Since both these systems have contrasting advantages, a hybrid system, incorporating both, was also developed to leverage all the strong points. The submitted systems garnered BLEU scores of 8.701943312, 0.6361336198, and 11.78873307 respectively and the scores of the hybrid system helped us to the fourth spot in the competition leaderboard.

pdf abs
JUNLP at SemEval-2020 Task 9: Sentiment Analysis of Hindi-English Code Mixed Data Using Grid Search Cross Validation
Avishek Garain | Sainik Mahata | Dipankar Das
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Code-mixing is a phenomenon which arises mainly in multilingual societies. Multilingual people, who are well versed in their native languages and also English speakers, tend to code-mix using English-based phonetic typing and the insertion of anglicisms in their main language. This linguistic phenomenon poses a great challenge to conventional NLP domains such as Sentiment Analysis, Machine Translation, and Text Summarization, to name a few. In this work, we focus on working out a plausible solution to the domain of Code-Mixed Sentiment Analysis. This work was done as participation in the SemEval-2020 Sentimix Task, where we focused on the sentiment analysis of English-Hindi code-mixed sentences. our username for the submission was “sainik.mahata” and team name was “JUNLP”. We used feature extraction algorithms in conjunction with traditional machine learning algorithms such as SVR and Grid Search in an attempt to solve the task. Our approach garnered an f1-score of 66.2% when tested using metrics prepared by the organizers of the task.

2019

pdf abs
NLP at SemEval-2019 Task 6: Detecting Offensive language using Neural Networks
Prashant Kapil | Asif Ekbal | Dipankar Das
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we built several deep learning architectures to participate in shared task OffensEval: Identifying and categorizing Offensive language in Social media by semEval-2019. The dataset was annotated with three level annotation schemes and task was to detect between offensive and not offensive, categorization and target identification in offensive contents. Deep learning models with POS information as feature were also leveraged for classification. The three best models that performed best on individual sub tasks are stacking of CNN-Bi-LSTM with Attention, BiLSTM with POS information added with word features and Bi-LSTM for third task. Our models achieved a Macro F1 score of 0.7594, 0.5378 and 0.4588 in Task(A,B,C) respectively with rank of 33rd, 54th and 52nd out of 103, 75 and 65 submissions. The three best models that performed best on individual sub task are using Neural Networks.

pdf abs
JUMT at WMT2019 News Translation Task: A Hybrid Approach to Machine Translation for Lithuanian to English
Sainik Kumar Mahata | Avishek Garain | Adityar Rayala | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In the current work, we present a description of the system submitted to WMT 2019 News Translation Shared task. The system was created to translate news text from Lithuanian to English. To accomplish the given task, our system used a Word Embedding based Neural Machine Translation model to post edit the outputs generated by a Statistical Machine Translation model. The current paper documents the architecture of our model, descriptions of the various modules and the results produced using the same. Our system garnered a BLEU score of 17.6.

pdf abs
Development of POS tagger for English-Bengali Code-Mixed data
Tathagata Raha | Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 16th International Conference on Natural Language Processing

Code-mixed texts are widespread nowadays due to the advent of social media. Since these texts combine two languages to formulate a sentence, it gives rise to various research problems related to Natural Language Processing. In this paper, we try to excavate one such problem, namely, Parts of Speech tagging of code-mixed texts. We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script. Our approach initially involves the collection and cleaning of English-Bengali code-mixed tweets. These tweets were used as a development dataset for building our system. The proposed system is a modular approach that starts by tagging individual tokens with their respective languages and then passes them to different POS taggers, designed for different languages (English and Bengali, in our case). Tags given by the two systems are later joined together and the final result is then mapped to a universal POS tag set. Our system was checked using 100 manually POS tagged code-mixed sentences and it returned an accuracy of 75.29%.

2018

pdf abs
JUCBNMT at WMT2018 News Translation Task: Character Based Neural Machine Translation of Finnish to English
Sainik Kumar Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

In the current work, we present a description of the system submitted to WMT 2018 News Translation Shared task. The system was created to translate news text from Finnish to English. The system used a Character Based Neural Machine Translation model to accomplish the given task. The current paper documents the preprocessing steps, the description of the submitted system and the results produced using the same. Our system garnered a BLEU score of 12.9.

pdf
Summarization of Table Citations from Text
Monalisa Dey | Salma Mandi | Dipankar Das
Proceedings of the 15th International Conference on Natural Language Processing

pdf
A Content-based Recommendation System for Medical Concepts: Disease and Symptom
Anupam Mondal | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 15th International Conference on Natural Language Processing

pdf
SMT vs NMT: A Comparison over Hindi and Bengali Simple Sentences
Sainik Kumar Mahata | Soumil Mandal | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 15th International Conference on Natural Language Processing

pdf bib abs
WME 3.0: An Enhanced and Validated Lexicon of Medical Concepts
Anupam Mondal | Dipankar Das | Erik Cambria | Sivaji Bandyopadhyay
Proceedings of the 9th Global Wordnet Conference

Information extraction in the medical domain is laborious and time-consuming due to the insufficient number of domain-specific lexicons and lack of involvement of domain experts such as doctors and medical practitioners. Thus, in the present work, we are motivated to design a new lexicon, WME 3.0 (WordNet of Medical Events), which contains over 10,000 medical concepts along with their part of speech, gloss (descriptive explanations), polarity score, sentiment, similar sentiment words, category, affinity score and gravity score features. In addition, the manual annotators help to validate the overall as well as individual category level of medical concepts of WME 3.0 using Cohen’s Kappa agreement metric. The agreement score indicates almost correct identification of medical concepts and their assigned features in WME 3.0.

2017

pdf abs
Identification of Character Adjectives from Mahabharata
Apurba Paul | Dipankar Das
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

The present paper describes the identification of prominent characters and their adjectives from Indian mythological epic, Mahabharata, written in English texts. However, in contrast to the tra-ditional approaches of named entity identifica-tion, the present system extracts hidden attributes associated with each of the characters (e.g., character adjectives). We observed distinct phrase level linguistic patterns that hint the pres-ence of characters in different text spans. Such six patterns were used in order to extract the cha-racters. On the other hand, a distinguishing set of novel features (e.g., multi-word expression, nodes and paths of parse tree, immediate ancestors etc.) was employed. Further, the correlation of the features is also measured in order to identify the important features. Finally, we applied various machine learning algorithms (e.g., Naive Bayes, KNN, Logistic Regression, Decision Tree, Random Forest etc.) along with deep learning to classify the patterns as characters or non-characters in order to achieve decent accuracy. Evaluation shows that phrase level linguistic patterns as well as the adopted features are highly active in capturing characters and their adjectives.

pdf abs
JUNLP at IJCNLP-2017 Task 3: A Rank Prediction Model for Review Opinion Diversification
Monalisa Dey | Anupam Mondal | Dipankar Das
Proceedings of the IJCNLP 2017, Shared Tasks

IJCNLP-17 Review Opinion Diversification (RevOpiD-2017) task has been designed for ranking the top-k reviews of a product from a set of reviews, which assists in identifying a summarized output to express the opinion of the entire review set. The task is divided into three independent subtasks as subtask-A,subtask-B, and subtask-C. Each of these three subtasks selects the top-k reviews based on helpfulness, representativeness, and exhaustiveness of the opinions expressed in the review set individually. In order to develop the modules and predict the rank of reviews for all three subtasks, we have employed two well-known supervised classifiers namely, Naïve Bayes and Logistic Regression on the top of several extracted features such as the number of nouns, number of verbs, and number of sentiment words etc from the provided datasets. Finally, the organizers have helped to validate the predicted outputs for all three subtasks by using their evaluation metrics. The metrics provide the scores of list size 5 as (0.80 (mth)) for subtask-A, (0.86 (cos), 0.87 (cos d), 0.71 (cpr), 4.98 (a-dcg), and 556.94 (wt)) for subtask B, and (10.94 (unwt) and 0.67 (recall)) for subtask C individually.

pdf abs
NITMZ-JU at IJCNLP-2017 Task 4: Customer Feedback Analysis
Somnath Banerjee | Partha Pakray | Riyanka Manna | Dipankar Das | Alexander Gelbukh
Proceedings of the IJCNLP 2017, Shared Tasks

In this paper, we describe a deep learning framework for analyzing the customer feedback as part of our participation in the shared task on Customer Feedback Analysis at the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017). A Convolutional Neural Network (CNN) based deep neural network model was employed for the customer feedback task. The proposed system was evaluated on two languages, namely, English and French.

pdf abs
JU NITM at IJCNLP-2017 Task 5: A Classification Approach for Answer Selection in Multi-choice Question Answering System
Sandip Sarkar | Dipankar Das | Partha Pakray
Proceedings of the IJCNLP 2017, Shared Tasks

This paper describes the participation of the JU NITM team in IJCNLP-2017 Task 5: “Multi-choice Question Answering in Examinations”. The main aim of this shared task is to choose the correct option for each multi-choice question. Our proposed model includes vector representations as feature and machine learning for classification. At first we represent question and answer in vector space and after that find the cosine similarity between those two vectors. Finally we apply classification approach to find the correct answer. Our system was only developed for the English language, and it obtained an accuracy of 40.07% for test dataset and 40.06% for valid dataset.

pdf abs
JU CSE NLP @ SemEval 2017 Task 7: Employing Rules to Detect and Interpret English Puns
Aniket Pramanick | Dipankar Das
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

System description. Implementation of HMM and Cyclic Dependency Network.

pdf abs
BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

A Statistical Machine Translation (SMT) system is always trained using large parallel corpus to produce effective translation. Not only is the corpus scarce, it also involves a lot of manual labor and cost. Parallel corpus can be prepared by employing comparable corpora where a pair of corpora is in two different languages pointing to the same domain. In the present work, we try to build a parallel corpus for French-English language pair from a given comparable corpus. The data and the problem set are provided as part of the shared task organized by BUCC 2017. We have proposed a system that first translates the sentences by heavily relying on Moses and then group the sentences based on sentence length similarity. Finally, the one to one sentence selection was done based on Cosine Similarity algorithm.

pdf
Relationship Extraction based on Category of Medical Concepts from Lexical Contexts
Anupam Mondal | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
Retrieving Similar Lyrics for Music Recommendation System
Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
Developing Lexicon and Classifier for Personality Identification in Texts
Kumar Gourav Das | Dipankar Das
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
A Deep Dive into Identification of Characters from Mahabharata
Apurba Paul | Dipankar Das
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf
JU_NLP at SemEval-2016 Task 6: Detecting Stance in Tweets using Support Vector Machines
Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
JUNITMZ at SemEval-2016 Task 1: Identifying Semantic Similarity Using Levenshtein Ratio
Sandip Sarkar | Dipankar Das | Partha Pakray | Alexander Gelbukh
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
JU_NLP at SemEval-2016 Task 11: Identifying Complex Words in a Sentence
Niloy Mukherjee | Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
JUNLP at SemEval-2016 Task 13: A Language Independent Approach for Hypernym Identification
Promita Maitra | Dipankar Das
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
WMT2016: A Hybrid Approach to Bilingual Document Alignment
Sainik Mahata | Dipankar Das | Santanu Pal
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf
Unraveling the English-Bengali Code-Mixing Phenomenon
Arunavha Chanda | Dipankar Das | Chandan Mazumdar
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf
Part-of-speech Tagging of Code-Mixed Social Media Text
Souvick Ghosh | Satanu Ghosh | Dipankar Das
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf
Columbia-Jadavpur submission for EMNLP 2016 Code-Switching Workshop Shared Task: System description
Arunavha Chanda | Dipankar Das | Chandan Mazumdar
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf abs
Multimodal Mood Classification - A Case Study of Differences in Hindi and Western Songs
Braja Gopal Patra | Dipankar Das | Sivaji Bandyopadhyay
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Music information retrieval has emerged as a mainstream research area in the past two decades. Experiments on music mood classification have been performed mainly on Western music based on audio, lyrics and a combination of both. Unfortunately, due to the scarcity of digitalized resources, Indian music fares poorly in music mood retrieval research. In this paper, we identified the mood taxonomy and prepared multimodal mood annotated datasets for Hindi and Western songs. We identified important audio and lyric features using correlation based feature selection technique. Finally, we developed mood classification systems using Support Vector Machines and Feed Forward Neural Networks based on the features collected from audio, lyrics, and a combination of both. The best performing multimodal systems achieved F-measures of 75.1 and 83.5 for classifying the moods of the Hindi and Western songs respectively using Feed Forward Neural Networks. A comparative analysis indicates that the selected features work well for mood classification of the Western songs and produces better results as compared to the mood classification systems for Hindi songs.

pdf abs
WME: Sense, Polarity and Affinity based Concept Resource for Medical Events
Anupam Mondal | Dipankar Das | Erik Cambria | Sivaji Bandyopadhyay
Proceedings of the 8th Global WordNet Conference (GWC)

In order to overcome the lack of medical corpora, we have developed a WordNet for Medical Events (WME) for identifying medical terms and their sense related information using a seed list. The initial WME resource contains 1654 medical terms or concepts. In the present research, we have reported the enhancement of WME with 6415 number of medical concepts along with their conceptual features viz. Parts-of-Speech (POS), gloss, semantics, polarity, sense and affinity. Several polarity lexicons viz. SentiWordNet, SenticNet, Bing Liu’s subjectivity list and Taboda’s adjective list were introduced with WordNet synonyms and hyponyms for expansion. The semantics feature guided us to build a semantic co-reference relation based network between the related medical concepts. These features help to prepare a medical concept network for better sense relation based visualization. Finally, we evaluated with respect to Adaptive Lesk Algorithm and conducted an agreement analysis for validating the expanded WME resource.

Dipankar Das

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Co-authors

Venues