Radhika Mamidi


2023

pdf
Witcherses at SemEval-2023 Task 12: Ensemble Learning for African Sentiment Analysis
Monil Gokani | K V Aditya Srivatsa | Radhika Mamidi
Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper describes our system submission for SemEval-2023 Task 12 AfriSenti-SemEval: Sentiment Analysis for African Languages. We propose an XGBoost-based ensemble model trained on emoticon frequency-based features and the predictions of several statistical models such as SVMs, Logistic Regression, Random Forests, and BERT-based pre-trained language models such as AfriBERTa and AfroXLMR. We also report results from additional experiments not in the system. Our system achieves a mixed bag of results, achieving a best rank of 7th in three of the languages - Igbo, Twi, and Yoruba.

pdf
Matt Bai at SemEval-2023 Task 5: Clickbait spoiler classification via BERT
Nukit Tailor | Radhika Mamidi
Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023)

The Clickbait Spoiling shared task aims at tackling two aspects of spoiling: classifying the spoiler type based on its length and generating the spoiler. This paper focuses on the task of classifying the spoiler type. Better classification of the spoiler type would eventually help in generating a better spoiler for the post. We use BERT-base (cased) to classify the clickbait posts. The model achieves a balanced accuracy of 0.63 as we give only the post content as the input to our model instead of the concatenation of the post title and post content to find out the differences that the post title might be bringing in.

pdf
PanwarJayant at SemEval-2023 Task 10: Exploring the Effectiveness of Conventional Machine Learning Techniques for Online Sexism Detection
Jayant Panwar | Radhika Mamidi
Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023)

The rapid growth of online communication using social media platforms has led to an increase in the presence of hate speech, especially in terms of sexist language online. The proliferation of such hate speech has a significant impact on the mental health and well-being of the users and hence the need for automated systems to detect and filter such texts. In this study, we explore the effectiveness of conventional machine learning techniques for detecting sexist text. We explore five conventional classifiers, namely, Logistic Regression, Decision Tree, XGBoost, Support Vector Machines, and Random Forest. The results show that different classifiers perform differently on each task due to their different inherent architectures which may be suited to a certain problem more. These models are trained on the shared task dataset, which includes both sexist and non-sexist texts.All in all, this study explores the potential of conventional machine learning techniques in detecting online sexist content. The results of this study highlight the strengths and weaknesses of all classifiers with respect to all subtasks. The results of this study will be useful for researchers and practitioners interested in developing systems for detecting or filtering online hate speech.

pdf
Billy-Batson at SemEval-2023 Task 5: An Information Condensation based System for Clickbait Spoiling
Anubhav Sharma | Sagar Joshi | Tushar Abhishek | Radhika Mamidi | Vasudeva Varma
Proceedings of the The 17th International Workshop on Semantic Evaluation (SemEval-2023)

The Clickbait Challenge targets spoiling the clickbaits using short pieces of information known as spoilers to satisfy the curiosity induced by a clickbait post.The large context of the article associated with the clickbait and differences in the spoiler forms, make the task challenging.Hence, to tackle the large context, we propose an Information Condensation-based approach, which prunes down the unnecessary context.Given an article, our filtering module optimised with a contrastive learning objective first selects the parapraphs that are the most relevant to the corresponding clickbait.The resulting condensed article is then fed to the two downstream tasks of spoiler type classification and spoiler generation.We demonstrate and analyze the gains from this approach on both the tasks.Overall, we win the task of spoiler type classification and achieve competitive results on spoiler generation.

2022

pdf
English To Indian Sign Language:Rule-Based Translation System Along With Multi-Word Expressions and Synonym Substitution
Abhigyan Ghosh | Radhika Mamidi
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

The hearing challenged communities all over the world face difficulties to communicate with others. Machine translation has been one of the prominent technologies to facilitate communication with the deaf and hard of hearing community worldwide. We have explored and formulated the fundamental rules of Indian Sign Language(ISL) and implemented them as a translation mechanism of English Text to Indian Sign Language glosses. According to the formulated rules and sub-rules, the source text structure is identified and transferred to the target ISL gloss. This target language is such that it can be easily converted to videos using the Indian Sign Language dictionary. This research work also mentions the intermediate phases of the transfer process and innovations in the process such as Multi-Word Expression detection and synonym substitution to handle the limited vocabulary size of Indian Sign Language while producing semantically accurate translations.

pdf
LastResort at SemEval-2022 Task 4: Towards Patronizing and Condescending Language Detection using Pre-trained Transformer Based Models Ensembles
Samyak Agrawal | Radhika Mamidi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents our solutions systems for Task4 at SemEval2022: Patronizing and Condescending Language Detection. This shared task contains two sub-tasks. The first sub-task is a binary classification task whose goal is to predict whether a given paragraph contains any form of patronising or condescending language(PCL). For the second sub-task, given a paragraph, we have to find which PCL categories express the condescension. Here we have a total of 7 overlapping sub-categories for PCL. Our proposed solution uses BERT based ensembled models with hard voting and techniques applied to take care of class imbalances. Our paper describes the system architecture of the submitted solution and other experiments that we conducted.

pdf
LastResort at SemEval-2022 Task 5: Towards Misogyny Identification using Visual Linguistic Model Ensembles And Task-Specific Pretraining
Samyak Agrawal | Radhika Mamidi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In current times, memes have become one of the most popular mediums to share jokes and information with the masses over the internet. Memes can also be used as tools to spread hatred and target women through degrading content disguised as humour. The task, Multimedia Automatic Misogyny Identification (MAMI), is to detect misogyny in these memes. This task is further divided into two sub-tasks: (A) Misogynous meme identification, where a meme should be categorized either as misogynous or not misogynous and (B) Categorizing these misogynous memes into potential overlapping subcategories. In this paper, we propose models leveraging task-specific pretraining with transfer learning on Visual Linguistic models. Our best performing models scored 0.686 and 0.691 on sub-tasks A and B respectively.

pdf
CMNEROne at SemEval-2022 Task 11: Code-Mixed Named Entity Recognition by leveraging multilingual data
Suman Dowlagar | Radhika Mamidi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Identifying named entities is, in general, a practical and challenging task in the field of Natural Language Processing. Named Entity Recognition on the code-mixed text is further challenging due to the linguistic complexity resulting from the nature of the mixing. This paper addresses the submission of team CMNEROne to the SEMEVAL 2022 shared task 11 MultiCoNER. The Code-mixed NER task aimed to identify named entities on the code-mixed dataset. Our work consists of Named Entity Recognition (NER) on the code-mixed dataset by leveraging the multilingual data. We achieved a weighted average F1 score of 0.7044, i.e., 6% greater than the NER baseline.

pdf
Sammaan@LT-EDI-ACL2022: Ensembled Transformers Against Homophobia and Transphobia
Ishan Sanjeev Upadhyay | Kv Aditya Srivatsa | Radhika Mamidi
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Hateful and offensive content on social media platforms can have negative effects on users and can make online communities more hostile towards certain people and hamper equality, diversity and inclusion. In this paper, we describe our approach to classify homophobia and transphobia in social media comments. We used an ensemble of transformer-based models to build our classifier. Our model ranked 2nd for English, 8th for Tamil and 10th for Tamil-English.

pdf
DepressionOne@LT-EDI-ACL2022: Using Machine Learning with SMOTE and Random UnderSampling to Detect Signs of Depression on Social Media Text.
Suman Dowlagar | Radhika Mamidi
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Depression is a common and serious medical illness that negatively affects how you feel, the way you think, and how you act. Detecting depression is essential as it must be treated early to avoid painful consequences. Nowadays, people are broadcasting how they feel via posts and comments. Using social media, we can extract many comments related to depression and use NLP techniques to train and detect depression. This work presents the submission of the DepressionOne team at LT-EDI-2022 for the shared task, detecting signs of depression from social media text. The depression data is small and unbalanced. Thus, we have used oversampling and undersampling methods such as SMOTE and RandomUnderSampler to represent the data. Later, we used machine learning methods to train and detect the signs of depression.

pdf
Towards Detecting Political Bias in Hindi News Articles
Samyak Agrawal | Kshitij Gupta | Devansh Gautam | Radhika Mamidi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Political propaganda in recent times has been amplified by media news portals through biased reporting, creating untruthful narratives on serious issues causing misinformed public opinions with interests of siding and helping a particular political party. This issue proposes a challenging NLP task of detecting political bias in news articles.We propose a transformer-based transfer learning method to fine-tune the pre-trained network on our data for this bias detection. As the required dataset for this particular task was not available, we created our dataset comprising 1388 Hindi news articles and their headlines from various Hindi news media outlets. We marked them on whether they are biased towards, against, or neutral to BJP, a political party, and the current ruling party at the centre in India.

pdf
TeluguNER: Leveraging Multi-Domain Named Entity Recognition with Deep Transformers
Suma Reddy Duggenpudi | Subba Reddy Oota | Mounika Marreddy | Radhika Mamidi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Named Entity Recognition (NER) is a successful and well-researched problem in English due to the availability of resources. The transformer models, specifically the masked-language models (MLM), have shown remarkable performance in NER during recent times. With growing data in different online platforms, there is a need for NER in other languages too. NER remains to be underexplored in Indian languages due to the lack of resources and tools. Our contributions in this paper include (i) Two annotated NER datasets for the Telugu language in multiple domains: Newswire Dataset (ND) and Medical Dataset (MD), and we combined ND and MD to form Combined Dataset (CD) (ii) Comparison of the finetuned Telugu pretrained transformer models (BERT-Te, RoBERTa-Te, and ELECTRA-Te) with other baseline models (CRF, LSTM-CRF, and BiLSTM-CRF) (iii) Further investigation of the performance of Telugu pretrained transformer models against the multilingual models mBERT, XLM-R, and IndicBERT. We find that pretrained Telugu language models (BERT-Te and RoBERTa) outperform the existing pretrained multilingual and baseline models in NER. On a large dataset (CD) of 38,363 sentences, the BERT-Te achieves a high F1-score of 0.80 (entity-level) and 0.75 (token-level). Further, these pretrained Telugu models have shown state-of-the-art performance on various existing Telugu NER datasets. We open-source our dataset, pretrained models, and code.

pdf
Towards Toxic Positivity Detection
Ishan Sanjeev Upadhyay | KV Aditya Srivatsa | Radhika Mamidi
Proceedings of the Tenth International Workshop on Natural Language Processing for Social Media

Over the past few years, there has been a growing concern around toxic positivity on social media which is a phenomenon where positivity is used to minimize one’s emotional experience. In this paper, we create a dataset for toxic positivity classification from Twitter and an inspirational quote website. We then perform benchmarking experiments using various text classification models and show the suitability of these models for the task. We achieved a macro F1 score of 0.71 and a weighted F1 score of 0.85 by using an ensemble model. To the best of our knowledge, our dataset is the first such dataset created.

2021

pdf bib
Developing Conversational Data and Detection of Conversational Humor in Telugu
Vaishnavi Pamulapati | Radhika Mamidi
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

In the field of humor research, there has been a recent surge of interest in the sub-domain of Conversational Humor (CH). This study has two main objectives. (a) develop a conversational (humorous and non-humorous) dataset in Telugu. (b) detect CH in the compiled dataset. In this paper, the challenges faced while collecting the data and experiments carried out are elucidated. Transfer learning and non-transfer learning techniques are implemented by utilizing pre-trained models such as FastText word embeddings, BERT language models and Text GCN, which learns the word and document embeddings simultaneously of the corpus given. State-of-the-art results are observed with a 99.3% accuracy and a 98.5% f1 score achieved by BERT.

pdf
Graph Convolutional Networks with Multi-headed Attention for Code-Mixed Sentiment Analysis
Suman Dowlagar | Radhika Mamidi
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Code-mixing is a frequently observed phenomenon in multilingual communities where a speaker uses multiple languages in an utterance or sentence. Code-mixed texts are abundant, especially in social media, and pose a problem for NLP tools as they are typically trained on monolingual corpora. Recently, finding the sentiment from code-mixed text has been attempted by some researchers in SentiMix SemEval 2020 and Dravidian-CodeMix FIRE 2020 shared tasks. Mostly, the attempts include traditional methods, long short term memory, convolutional neural networks, and transformer models for code-mixed sentiment analysis (CMSA). However, no study has explored graph convolutional neural networks on CMSA. In this paper, we propose the graph convolutional networks (GCN) for sentiment analysis on code-mixed text. We have used the datasets from the Dravidian-CodeMix FIRE 2020. Our experimental results on multiple CMSA datasets demonstrate that the GCN with multi-headed attention model has shown an improvement in classification metrics.

pdf
OFFLangOne@DravidianLangTech-EACL2021: Transformers with the Class Balanced Loss for Offensive Language Identification in Dravidian Code-Mixed text.
Suman Dowlagar | Radhika Mamidi
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

The intensity of online abuse has increased in recent years. Automated tools are being developed to prevent the use of hate speech and offensive content. Most of the technologies use natural language and machine learning tools to identify offensive text. In a multilingual society, where code-mixing is a norm, the hate content would be delivered in a code-mixed form in social media, which makes the offensive content identification, further challenging. In this work, we participated in the EACL task to detect offensive content in the code-mixed social media scenario. The methodology uses a transformer model with transliteration and class balancing loss for offensive content identification. In this task, our model has been ranked 2nd in Malayalam-English and 4th in Tamil-English code-mixed languages.

pdf
Jibes & Delights: A Dataset of Targeted Insults and Compliments to Tackle Online Abuse
Ravsimar Sodhi | Kartikey Pant | Radhika Mamidi
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

Online abuse and offensive language on social media have become widespread problems in today’s digital age. In this paper, we contribute a Reddit-based dataset, consisting of 68,159 insults and 51,102 compliments targeted at individuals instead of targeting a particular community or race. Secondly, we benchmark multiple existing state-of-the-art models for both classification and unsupervised style transfer on the dataset. Finally, we analyse the experimental results and conclude that the transfer task is challenging, requiring the models to understand the high degree of creativity exhibited in the data.

pdf
Volta at SemEval-2021 Task 6: Towards Detecting Persuasive Texts and Images using Textual and Multimodal Ensemble
Kshitij Gupta | Devansh Gautam | Radhika Mamidi
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Memes are one of the most popular types of content used to spread information online. They can influence a large number of people through rhetorical and psychological techniques. The task, Detection of Persuasion Techniques in Texts and Images, is to detect these persuasive techniques in memes. It consists of three subtasks: (A) Multi-label classification using textual content, (B) Multi-label classification and span identification using textual content, and (C) Multi-label classification using visual and textual content. In this paper, we propose a transfer learning approach to fine-tune BERT-based models in different modalities. We also explore the effectiveness of ensembles of models trained in different modalities. We achieve an F1-score of 57.0, 48.2, and 52.1 in the corresponding subtasks.

pdf
IIITH at SemEval-2021 Task 7: Leveraging transformer-based humourous and offensive text detection architectures using lexical and hurtlex features and task adaptive pretraining
Tathagata Raha | Ishan Sanjeev Upadhyay | Radhika Mamidi | Vasudeva Varma
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes our approach (IIITH) for SemEval-2021 Task 5: HaHackathon: Detecting and Rating Humor and Offense. Our results focus on two major objectives: (i) Effect of task adaptive pretraining on the performance of transformer based models (ii) How does lexical and hurtlex features help in quantifying humour and offense. In this paper, we provide a detailed description of our approach along with comparisions mentioned above.

pdf
ViTA: Visual-Linguistic Translation by Aligning Object Tags
Kshitij Gupta | Devansh Gautam | Radhika Mamidi
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

Multimodal Machine Translation (MMT) enriches the source text with visual information for translation. It has gained popularity in recent years, and several pipelines have been proposed in the same direction. Yet, the task lacks quality datasets to illustrate the contribution of visual modality in the translation systems. In this paper, we propose our system under the team name Volta for the Multimodal Translation Task of WAT 2021 from English to Hindi. We also participate in the textual-only subtask of the same language pair for which we use mBART, a pretrained multilingual sequence-to-sequence model. For multimodal translation, we propose to enhance the textual input by bringing the visual information to a textual domain by extracting object tags from the image. We also explore the robustness of our system by systematically degrading the source text. Finally, we achieve a BLEU score of 44.6 and 51.6 on the test set and challenge set of the multimodal task.

pdf
EDIOne@LT-EDI-EACL2021: Pre-trained Transformers with Convolutional Neural Networks for Hope Speech Detection.
Suman Dowlagar | Radhika Mamidi
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

Hope is an essential aspect of mental health stability and recovery in every individual in this fast-changing world. Any tools and methods developed for detection, analysis, and generation of hope speech will be beneficial. In this paper, we propose a model on hope-speech detection to automatically detect web content that may play a positive role in diffusing hostility on social media. We perform the experiments by taking advantage of pre-processing and transfer-learning models. We observed that the pre-trained multilingual-BERT model with convolution neural networks gave the best results. Our model ranked first, third, and fourth ranks on English, Malayalam-English, and Tamil-English code-mixed datasets.

pdf
Autobots@LT-EDI-EACL2021: One World, One Family: Hope Speech Detection with BERT Transformer Model
Sunil Gundapu | Radhika Mamidi
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

The rapid rise of online social networks like YouTube, Facebook, Twitter allows people to express their views more widely online. However, at the same time, it can lead to an increase in conflict and hatred among consumers in the form of freedom of speech. Therefore, it is essential to take a positive strengthening method to research on encouraging, positive, helping, and supportive social media content. In this paper, we describe a Transformer-based BERT model for Hope speech detection for equality, diversity, and inclusion, submitted for LT-EDI-2021 Task 2. Our model achieves a weighted averaged f1-score of 0.93 on the test set.

pdf
Hopeful Men@LT-EDI-EACL2021: Hope Speech Detection Using Indic Transliteration and Transformers
Ishan Sanjeev Upadhyay | Nikhil E | Anshul Wadhawan | Radhika Mamidi
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

This paper aims to describe the approach we used to detect hope speech in the HopeEDI dataset. We experimented with two approaches. In the first approach, we used contextual embeddings to train classifiers using logistic regression, random forest, SVM, and LSTM based models. The second approach involved using a majority voting ensemble of 11 models which were obtained by fine-tuning pre-trained transformer models (BERT, ALBERT, RoBERTa, IndicBERT) after adding an output layer. We found that the second approach was superior for English, Tamil and Malayalam. Our solution got a weighted F1 score of 0.93, 0.75 and 0.49 for English, Malayalam and Tamil respectively. Our solution ranked 1st in English, 8th in Malayalam and 11th in Tamil.

pdf
How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages
Sourav Kumar | Salil Aggarwal | Dipti Misra Sharma | Radhika Mamidi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

India is one of the most linguistically diverse nations of the world and is culturally very rich. Most of these languages are somewhat similar to each other on account of sharing a common ancestry or being in contact for a long period of time. Nowadays, researchers are constantly putting efforts in utilizing the language relatedness to improve the performance of various NLP systems such as cross lingual semantic search, machine translation, sentiment analysis systems, etc. So in this paper, we performed an extensive case study on similarity involving languages of the Indian subcontinent. Language similarity prediction is defined as the task of measuring how similar the two languages are on the basis of their lexical, morphological and syntactic features. In this study, we concentrate only on the approach to calculate lexical similarity between Indian languages by looking at various factors such as size and type of corpus, similarity algorithms, subword segmentation, etc. The main takeaways from our work are: (i) Relative order of the language similarities largely remain the same, regardless of the factors mentioned above, (ii) Similarity within the same language family is higher, (iii) Languages share more lexical features at the subword level.

pdf
Analyzing Curriculum Learning for Sentiment Analysis along Task Difficulty, Pacing and Visualization Axes
Anvesh Rao Vijjini | Kaveri Anuranjana | Radhika Mamidi
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

While Curriculum Learning (CL) has recently gained traction in Natural language Processing Tasks, it is still not adequately analyzed. Previous works only show their effectiveness but fail short to explain and interpret the internal workings fully. In this paper, we analyze curriculum learning in sentiment analysis along multiple axes. Some of these axes have been proposed by earlier works that need more in-depth study. Such analysis requires understanding where curriculum learning works and where it does not. Our axes of analysis include Task difficulty on CL, comparing CL pacing techniques, and qualitative analysis by visualizing the movement of attention scores in the model as curriculum phases progress. We find that curriculum learning works best for difficult tasks and may even lead to a decrement in performance for tasks with higher performance without curriculum learning. We see that One-Pass curriculum strategies suffer from catastrophic forgetting and attention movement visualization within curriculum pacing. This shows that curriculum learning breaks down the challenging main task into easier sub-tasks solved sequentially.

pdf
Efficient Multilingual Text Classification for Indian Languages
Salil Aggarwal | Sourav Kumar | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

India is one of the richest language hubs on the earth and is very diverse and multilingual. But apart from a few Indian languages, most of them are still considered to be resource poor. Since most of the NLP techniques either require linguistic knowledge that can only be developed by experts and native speakers of that language or they require a lot of labelled data which is again expensive to generate, the task of text classification becomes challenging for most of the Indian languages. The main objective of this paper is to see how one can benefit from the lexical similarity found in Indian languages in a multilingual scenario. Can a classification model trained on one Indian language be reused for other Indian languages? So, we performed zero-shot text classification via exploiting lexical similarity and we observed that our model performs best in those cases where the vocabulary overlap between the language datasets is maximum. Our experiments also confirm that a single multilingual model trained via exploiting language relatedness outperforms the baselines by significant margins.

pdf
TEASER: Towards Efficient Aspect-based SEntiment Analysis and Recognition
Vaibhav Bajaj | Kartikey Pant | Ishan Upadhyay | Srinath Nair | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Sentiment analysis aims to detect the overall sentiment, i.e., the polarity of a sentence, paragraph, or text span, without considering the entities mentioned and their aspects. Aspect-based sentiment analysis aims to extract the aspects of the given target entities and their respective sentiments. Prior works formulate this as a sequence tagging problem or solve this task using a span-based extract-then-classify framework where first all the opinion targets are extracted from the sentence, and then with the help of span representations, the targets are classified as positive, negative, or neutral. The sequence tagging problem suffers from issues like sentiment inconsistency and colossal search space. Whereas, Span-based extract-then-classify framework suffers from issues such as half-word coverage and overlapping spans. To overcome this, we propose a similar span-based extract-then-classify framework with a novel and improved heuristic. Experiments on the three benchmark datasets (Restaurant14, Laptop14, Restaurant15) show our model consistently outperforms the current state-of-the-art. Moreover, we also present a novel supervised movie reviews dataset (Movie20) and a pseudo-labeled movie reviews dataset (moviesLarge) made explicitly for this task and report the results on the novel Movie20 dataset as well.

pdf
A Pre-trained Transformer and CNN Model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media Text
Suman Dowlagar | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence. There are no strict grammatical constraints observed in code-mixing, and it consists of non-standard variations of spelling. The linguistic complexity resulting from the above factors made the computational analysis of the code-mixed language a challenging task. Language identification (LI) and part of speech (POS) tagging are the fundamental steps that help analyze the structure of the code-mixed text. Often, the LI and POS tagging tasks are interdependent in the code-mixing scenario. We project the problem of dealing with multilingualism and grammatical structure while analyzing the code-mixed sentence as a joint learning task. In this paper, we jointly train and optimize language detection and part of speech tagging models in the code-mixed scenario. We used a Transformer with convolutional neural network architecture. We train a joint learning method by combining POS tagging and LI models on code-mixed social media text obtained from the ICON shared task.

pdf
Towards Quantifying Magnitude of Political Bias in News Articles Using a Novel Annotation Schema
Lalitha Kameswari | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Media bias is a predominant phenomenon present in most forms of print and electronic media such as news articles, blogs, tweets, etc. Since media plays a pivotal role in shaping public opinion towards political happenings, both political parties and media houses often use such sources as outlets to propagate their own prejudices to the public. There has been some research on detecting political bias in news articles. However, none of it attempts to analyse the nature of bias or quantify the magnitude ofthe bias in a given text. This paper presents a political bias annotated corpus viz. PoBiCo-21, which is annotated using a schema specifically designed with 10 labels to capture various techniques used to create political bias in news. We create a ranking of these techniques based on their contribution to bias. After validating the ranking, we propose methods to use it to quantify the magnitude of bias in political news articles.

pdf
Corpus Creation and Language Identification in Low-Resource Code-Mixed Telugu-English Text
Siva Subrahamanyam Varma Kusampudi | Anudeep Chaluvadi | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Code-Mixing (CM) is a common phenomenon in multilingual societies. CM plays a significant role in technology and medical fields where terminologies in the native language are not available or known. Language Identification (LID) of the CM data will help solve NLP tasks such as Spell Checking, Named Entity Recognition, Part-Of-Speech tagging, and Semantic Parsing. In the current era of machine learning, a common problem to the above-mentioned tasks is the availability of Learning data to train models. In this paper, we introduce two Telugu-English CM manually annotated datasets (Twitter dataset and Blog dataset). The Twitter dataset contains more romanization variability and misspelled words than the blog dataset. We compare across various classification models and perform extensive bench-marking using both Classical and Deep Learning Models for LID compared to existing models. We propose two architectures for language classification (Telugu and English) in CM data: (1) Word Level Classification (2) Sentence Level word-by-word Classification and compare these approaches presenting two strong baselines for LID on these datasets.

pdf
Sentiment Analysis in Code-Mixed Telugu-English Text with Unsupervised Data Normalization
Siva Subrahamanyam Varma Kusampudi | Preetham Sathineni | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In a multilingual society, people communicate in more than one language, leading to Code-Mixed data. Sentimental analysis on Code-Mixed Telugu-English Text (CMTET) poses unique challenges. The unstructured nature of the Code-Mixed Data is due to the informal language, informal transliterations, and spelling errors. In this paper, we introduce an annotated dataset for Sentiment Analysis in CMTET. Also, we report an accuracy of 80.22% on this dataset using novel unsupervised data normalization with a Multilayer Perceptron (MLP) model. This proposed data normalization technique can be extended to any NLP task involving CMTET. Further, we report an increase of 2.53% accuracy due to this data normalization approach in our best model.

pdf
Towards Sentiment Analysis of Tobacco Products’ Usage in Social Media
Venkata Himakar Yanamandra | Kartikey Pant | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Contemporary tobacco-related studies are mostly concerned with a single social media platform while missing out on a broader audience. Moreover, they are heavily reliant on labeled datasets, which are expensive to make. In this work, we explore sentiment and product identification on tobacco-related text from two social media platforms. We release SentiSmoke-Twitter and SentiSmoke-Reddit datasets, along with a comprehensive annotation schema for identifying tobacco products’ sentiment. We then perform benchmarking text classification experiments using state-of-the-art models, including BERT, RoBERTa, and DistilBERT. Our experiments show F1 scores as high as 0.72 for sentiment identification in the Twitter dataset, 0.46 for sentiment identification, and 0.57 for product identification using semi-supervised learning for Reddit.

pdf bib
Political Discourse Analysis: A Case Study of Code Mixing and Code Switching in Political Speeches
Dama Sravani | Lalitha Kameswari | Radhika Mamidi
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Political discourse is one of the most interesting data to study power relations in the framework of Critical Discourse Analysis. With the increase in the modes of textual and spoken forms of communication, politicians use language and linguistic mechanisms that contribute significantly in building their relationship with people, especially in a multilingual country like India with many political parties with different ideologies. This paper analyses code-mixing and code-switching in Telugu political speeches to determine the factors responsible for their usage levels in various social settings and communicative contexts. We also compile a detailed set of rules capturing dialectal variations between Standard and Telangana dialects of Telugu.

pdf
Gated Convolutional Sequence to Sequence Based Learning for English-Hingilsh Code-Switched Machine Translation.
Suman Dowlagar | Radhika Mamidi
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-Switching is the embedding of linguistic units or phrases from two or more languages in a single sentence. This phenomenon is practiced in all multilingual communities and is prominent in social media. Consequently, there is a growing need to understand code-switched translations by translating the code-switched text into one of the standard languages or vice versa. Neural Machine translation is a well-studied research problem in the monolingual text. In this paper, we have used the gated convolutional sequences to sequence networks for English-Hinglish translation. The convolutions in the model help to identify the compositional structure in the sequences more easily. The model relies on gating and performs multiple attention steps at encoder and decoder layers.

pdf
Automatic Learning Assistant in Telugu
Meghana Bommadi | Shreya Terupally | Radhika Mamidi
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

This paper presents a learning assistant that tests one’s knowledge and gives feedback that helps a person learn at a faster pace. A learning assistant (based on automated question generation) has extensive uses in education, information websites, self-assessment, FAQs, testing ML agents, research, etc. Multiple researchers, and companies have worked on Virtual Assistance, but majorly in English. We built our learning assistant for Telugu language to help with teaching in the mother tongue, which is the most efficient way of learning. Our system is built primarily based on Question Generation in Telugu. Many experiments were conducted on Question Generation in English in multiple ways. We have built the first hybrid machine learning and rule-based solution in Telugu, which proves efficient for short stories or short passages in children’s books. Our work covers the fundamental question forms with question types: adjective, yes/no, adverb, verb, when, where, whose, quotative, and quantitative (how many/how much). We constructed rules for question generation using Part of Speech (POS) tags and Universal Dependency (UD) tags along with linguistic information of the surrounding relevant context of the word. We used keyword matching, multilingual sentence embedding to evaluate the answer. Our system is primarily built on question generation in Telugu, and is also capable of evaluating the user’s answers to the generated questions.

2020

pdf
Gundapusunil at SemEval-2020 Task 8: Multimodal Memotion Analysis
Sunil Gundapu | Radhika Mamidi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Recent technological advancements in the Internet and Social media usage have resulted in the evolution of faster and efficient platforms of communication. These platforms include visual, textual and speech mediums and have brought a unique social phenomenon called Internet memes. Internet memes are in the form of images with witty, catchy, or sarcastic text descriptions. In this paper, we present a multi-modal sentiment analysis system using deep neural networks combining Computer Vision and Natural Language Processing. Our aim is different than the normal sentiment analysis goal of predicting whether a text expresses positive or negative sentiment; instead, we aim to classify the Internet meme as a positive, negative, or neutral, identify the type of humor expressed and quantify the extent to which a particular effect is being expressed. Our system has been developed using CNN and LSTM and outperformed the baseline score.

pdf
Gundapusunil at SemEval-2020 Task 9: Syntactic Semantic LSTM Architecture for SENTIment Analysis of Code-MIXed Data
Sunil Gundapu | Radhika Mamidi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

The phenomenon of mixing the vocabulary and syntax of multiple languages within the same utterance is called Code-Mixing. This is more evident in multilingual societies. In this paper, we have developed a system for SemEval 2020: Task 9 on Sentiment Analysis of Hindi-English code-mixed social media text. Our system first generates two types of embeddings for the social media text. In those, the first one is character level embeddings to encode the character level information and to handle the out-of-vocabulary entries and the second one is FastText word embeddings for capturing morphology and semantics. These two embeddings were passed to the LSTM network and the system outperformed the baseline model.

pdf
SUKHAN: Corpus of Hindi Shayaris annotated with Sentiment Polarity Information
Salil Aggarwal | Abhigyan Ghosh | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Shayari is a form of poetry mainly popular in the Indian subcontinent, in which the poet expresses his emotions and feelings in a very poetic manner. It is one of the best ways to express our thoughts and opinions. Therefore, it is of prime importance to have an annotated corpus of Hindi shayaris for the task of sentiment analysis. In this paper, we introduce SUKHAN, a dataset consisting of Hindi shayaris along with sentiment polarity labels. To the best of our knowledge, this is the first corpus of Hindi shayaris annotated with sentiment polarity information. This corpus contains a total of 733 Hindi shayaris of various genres. Also, this dataset is of utmost value as all the annotation is done manually by five annotators and this makes it a very rich dataset for training purposes. This annotated corpus is also used to build baseline sentiment classification models using machine learning techniques.

pdf
Does a Hybrid Neural Network based Feature Selection Model Improve Text Classification?
Suman Dowlagar | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Text classification is a fundamental problem in the field of natural language processing. Text classification mainly focuses on giving more importance to all the relevant features that help classify the textual data. Apart from these, the text can have redundant or highly correlated features. These features increase the complexity of the classification algorithm. Thus, many dimensionality reduction methods were proposed with the traditional machine learning classifiers. The use of dimensionality reduction methods with machine learning classifiers has achieved good results. In this paper, we propose a hybrid feature selection method for obtaining relevant features by combining various filter-based feature selection methods and fastText classifier. We then present three ways of implementing a feature selection and neural network pipeline. We observed a reduction in training time when feature selection methods are used along with neural networks. We also observed a slight increase in accuracy on some datasets.

pdf
Question and Answer pair generation for Telugu short stories
Meghana Bommadi | Shreya Terupally | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

Question Answer pair generation is a task that has been worked upon by multiple researchers in many languages. It has been a topic of interest due to its extensive uses in different fields like self assessment, academics, business website FAQs etc. Many experiments were conducted on Question Answering pair generation in English, concentrating on basic Wh-questions with a rule-based approach. We have built the first hybrid machine learning and rule-based solution in Telugu which is efficient for short stories or short passages in children’s books. Our work covers the fundamental question forms with the question types: adjective, yes/no, adverb, verb, when, where, whose, quotative, and quantitative(how many/ how much). We constructed rules for question generation using POS tags and UD tags along with linguistic information of the surrounding context of the word.

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

pdf
Multichannel LSTM-CNN for Telugu Text Classification
Sunil Gundapu | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

With the instantaneous growth of text information, retrieving domain-oriented information from the text data has a broad range of applications in Information Retrieval and Natural language Processing. Thematic keywords give a compressed representation of the text. Usually, Domain Identification plays a significant role in Machine Translation, Text Summarization, Question Answering, Information Extraction, and Sentiment Analysis. In this paper, we proposed the Multichannel LSTM-CNN methodology for Technical Domain Identification for Telugu. This architecture was used and evaluated in the context of the ICON shared task “TechDOfication 2020” (task h), and our system got 69.9% of the F1 score on the test dataset and 90.01% on the validation set.

pdf
Multilingual Pre-Trained Transformers and Convolutional NN Classification Models for Technical Domain Identification
Suman Dowlagar | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

In this paper, we present a transfer learning system to perform technical domain identification on multilingual text data. We have submitted two runs, one uses the transformer model BERT, and the other uses XLM-ROBERTa with the CNN model for text classification. These models allowed us to identify the domain of the given sentences for the ICON 2020 shared Task, TechDOfication: Technical Domain Identification. Our system ranked the best for the subtasks 1d, 1g for the given TechDOfication dataset.

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task

pdf bib
Unsupervised Technical Domain Terms Extraction using Term Extractor
Suman Dowlagar | Radhika Mamidi
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task

Terminology extraction, also known as term extraction, is a subtask of information extraction. The goal of terminology extraction is to extract relevant words or phrases from a given corpus automatically. This paper focuses on the unsupervised automated domain term extraction method that considers chunking, preprocessing, and ranking domain-specific terms using relevance and cohesion functions for ICON 2020 shared task 2: TermTraction.

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task
Dipti Misra Sharma | Asif Ekbal | Karunesh Arora | Sudip Kumar Naskar | Dipankar Ganguly | Sobha L | Radhika Mamidi | Sunita Arora | Pruthwik Mishra | Vandan Mujadia
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

pdf
Detecting Sarcasm in Conversation Context Using Transformer-Based Models
Adithya Avvaru | Sanath Vobilisetty | Radhika Mamidi
Proceedings of the Second Workshop on Figurative Language Processing

Sarcasm detection, regarded as one of the sub-problems of sentiment analysis, is a very typical task because the introduction of sarcastic words can flip the sentiment of the sentence itself. To date, many research works revolve around detecting sarcasm in one single sentence and there is very limited research to detect sarcasm resulting from multiple sentences. Current models used Long Short Term Memory (LSTM) variants with or without attention to detect sarcasm in conversations. We showed that the models using state-of-the-art Bidirectional Encoder Representations from Transformers (BERT), to capture syntactic and semantic information across conversation sentences, performed better than the current models. Based on the data analysis, we estimated that the number of sentences in the conversation that can contribute to the sarcasm and the results agrees to this estimation. We also perform a comparative study of our different versions of BERT-based model with other variants of LSTM model and XLNet (both using the estimated number of conversation sentences) and find out that BERT-based models outperformed them.

pdf
A Novel Annotation Schema for Conversational Humor: Capturing the Cultural Nuances in Kanyasulkam
Vaishnavi Pamulapati | Gayatri Purigilla | Radhika Mamidi
Proceedings of the 14th Linguistic Annotation Workshop

Humor research is a multifaceted field that has led to a better understanding of humor’s psychological effects and the development of different theories of humor. This paper’s main objective is to develop a hierarchical schema for a fine-grained annotation of Conversational Humor. Based on the Benign Violation Theory, the benignity or non-benignity of the interlocutor’s intentions is included within the framework. Under the categories mentioned above, in addition to different types of humor, the techniques utilized by these types are identified. Furthermore, a prominent play from Telugu, Kanyasulkam, is annotated to substantiate the work across cultures at multiple levels. The inter-annotator agreement is calculated to assess the accuracy and validity of the dataset. An in-depth analysis of the disagreement is performed to understand the subjectivity of humor better.

pdf
Annotated Corpus for Sentiment Analysis in Odia Language
Gaurav Mohanty | Pruthwik Mishra | Radhika Mamidi
Proceedings of the Twelfth Language Resources and Evaluation Conference

Given the lack of an annotated corpus of non-traditional Odia literature which serves as the standard when it comes sentiment analysis, we have created an annotated corpus of Odia sentences and made it publicly available to promote research in the field. Secondly, in order to test the usability of currently available Odia sentiment lexicon, we experimented with various classifiers by training and testing on the sentiment annotated corpus while using identified affective words from the same as features. Annotation and classification are done at sentence level as the usage of sentiment lexicon is best suited to sentiment analysis at this level. The created corpus contains 2045 Odia sentences from news domain annotated with sentiment labels using a well-defined annotation scheme. An inter-annotator agreement score of 0.79 is reported for the corpus.

pdf
Manovaad: A Novel Approach to Event Oriented Corpus Creation Capturing Subjectivity and Focus
Lalitha Kameswari | Radhika Mamidi
Proceedings of the Twelfth Language Resources and Evaluation Conference

In today’s era of globalisation, the increased outreach for every event across the world has been leading to conflicting opinions, arguments and disagreements, often reflected in print media and online social platforms. It is necessary to distinguish factual observations from personal judgements in news, as subjectivity in reporting can influence the audience’s perception of reality. Several studies conducted on the different styles of reporting in journalism are essential in understanding phenomena such as media bias and multiple interpretations of the same event. This domain finds applications in fields such as Media Studies, Discourse Analysis, Information Extraction, Sentiment Analysis, and Opinion Mining. We present an event corpus Manovaad-v1.0 consisting of 1035 news articles corresponding to 65 events from 3 levels of newspapers viz., Local, National, and International levels. Using this novel format, we correlate the trends in the degree of subjectivity with the geographical closeness of reporting using a Bi-RNN model. We also analyse the role of background and focus in event reporting and capture the focus shift patterns within a global discourse structure for an event. We do this across different levels of reporting and compare the results with the existing work on discourse processing.

pdf
Dataset Creation and Evaluation of Aspect Based Sentiment Analysis in Telugu, a Low Resource Language
Yashwanth Reddy Regatte | Rama Rohit Reddy Gangula | Radhika Mamidi
Proceedings of the Twelfth Language Resources and Evaluation Conference

In recent years, sentiment analysis has gained popularity as it is essential to moderate and analyse the information across the internet. It has various applications like opinion mining, social media monitoring, and market research. Aspect Based Sentiment Analysis (ABSA) is an area of sentiment analysis which deals with sentiment at a finer level. ABSA classifies sentiment with respect to each aspect to gain greater insights into the sentiment expressed. Significant contributions have been made in ABSA, but this progress is limited only to a few languages with adequate resources. Telugu lags behind in this area of research despite being one of the most spoken languages in India and an enormous amount of data being created each day. In this paper, we create a reliable resource for aspect based sentiment analysis in Telugu. The data is annotated for three tasks namely Aspect Term Extraction, Aspect Polarity Classification and Aspect Categorisation. Further, we develop baselines for the tasks using deep learning methods demonstrating the reliability and usefulness of the resource.

pdf
Leveraging Multilingual Resources for Language Invariant Sentiment Analysis
Allen Antony | Arghya Bhattacharya | Jaipal Goud | Radhika Mamidi
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Sentiment analysis is a widely researched NLP problem with state-of-the-art solutions capable of attaining human-like accuracies for various languages. However, these methods rely heavily on large amounts of labeled data or sentiment weighted language-specific lexical resources that are unavailable for low-resource languages. Our work attempts to tackle this data scarcity issue by introducing a neural architecture for language invariant sentiment analysis capable of leveraging various monolingual datasets for training without any kind of cross-lingual supervision. The proposed architecture attempts to learn language agnostic sentiment features via adversarial training on multiple resource-rich languages which can then be leveraged for inferring sentiment information at a sentence level on a low resource language. Our model outperforms the current state-of-the-art methods on the Multilingual Amazon Review Text Classification dataset [REF] and achieves significant performance gains over prior work on the low resource Sentiraama corpus [REF]. A detailed analysis of our research highlights the ability of our architecture to perform significantly well in the presence of minimal amounts of training data for low resource languages.

pdf bib
Enhancing Bias Detection in Political News Using Pragmatic Presupposition
Lalitha Kameswari | Dama Sravani | Radhika Mamidi
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

Usage of presuppositions in social media and news discourse can be a powerful way to influence the readers as they usually tend to not examine the truth value of the hidden or indirectly expressed information. Fairclough and Wodak (1997) discuss presupposition at a discourse level where some implicit claims are taken for granted in the explicit meaning of a text or utterance. From the Gricean perspective, the presuppositions of a sentence determine the class of contexts in which the sentence could be felicitously uttered. This paper aims to correlate the type of knowledge presupposed in a news article to the bias present in it. We propose a set of guidelines to identify various kinds of presuppositions in news articles and present a dataset consisting of 1050 articles which are annotated for bias (positive, negative or neutral) and the magnitude of presupposition. We introduce a supervised classification approach for detecting bias in political news which significantly outperforms the existing systems.

2019

pdf
Samajh-Boojh: A Reading Comprehension system in Hindi
Shalaka Vaidya | Hiranmai Sri Adibhatla | Radhika Mamidi
Proceedings of the 16th International Conference on Natural Language Processing

This paper presents a novel approach designed to answer questions on a reading comprehension passage. It is an end-to-end system which first focuses on comprehending the given passage wherein it converts unstructured passage into a structured data and later proceeds to answer the questions related to the passage using solely the aforementioned structured data. To the best of our knowledge, the proposed design is first of its kind which accounts for entire process of comprehending the passage and then answering the questions associated with the passage. The comprehension stage converts the passage into a Discourse Collection that comprises of the relation shared amongst logical sentences in given passage along with the key characteristics of each sentence. This design has its applications in academic domain , query comprehension in speech systems among others.

pdf
SmokEng: Towards Fine-grained Classification of Tobacco-related Social Media Text
Kartikey Pant | Venkata Himakar Yanamandra | Alok Debnath | Radhika Mamidi
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Contemporary datasets on tobacco consumption focus on one of two topics, either public health mentions and disease surveillance, or sentiment analysis on topical tobacco products and services. However, two primary considerations are not accounted for, the language of the demographic affected and a combination of the topics mentioned above in a fine-grained classification mechanism. In this paper, we create a dataset of 3144 tweets, which are selected based on the presence of colloquial slang related to smoking and analyze it based on the semantics of the tweet. Each class is created and annotated based on the content of the tweets such that further hierarchical methods can be easily applied. Further, we prove the efficacy of standard text classification methods on this dataset, by designing experiments which do both binary as well as multi-class classification. Our experiments tackle the identification of either a specific topic (such as tobacco product promotion), a general mention (cigarettes and related products) or a more fine-grained classification. This methodology paves the way for further analysis, such as understanding sentiment or style, which makes this dataset a vital contribution to both disease surveillance and tobacco use research.

pdf
Samvaadhana: A Telugu Dialogue System in Hospital Domain
Suma Reddy Duggenpudi | Kusampudi Siva Subrahamanyam Varma | Radhika Mamidi
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

In this paper, a dialogue system for Hospital domain in Telugu, which is a resource-poor Dravidian language, has been built. It handles various hospital and doctor related queries. The main aim of this paper is to present an approach for modelling a dialogue system in a resource-poor language by combining linguistic and domain knowledge. Focusing on the question answering aspect of the dialogue system, we identified Question Classification and Query Processing as the two most important parts of the dialogue system. Our method combines deep learning techniques for question classification and computational rule-based analysis for query processing. Human evaluation of the system has been performed as there is no automated evaluation tool for dialogue systems in Telugu. Our system achieves a high overall rating along with a significantly accurate context-capturing method as shown in the results.

pdf bib
Stance Detection in Code-Mixed Hindi-English Social Media Data using Multi-Task Learning
Sushmitha Reddy Sane | Suraj Tripathi | Koushik Reddy Sane | Radhika Mamidi
Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Social media sites like Facebook, Twitter, and other microblogging forums have emerged as a platform for people to express their opinions and views on different issues and events. It is often observed that people tend to take a stance; in favor, against or neutral towards a particular topic. The task of assessing the stance taken by the individual became significantly important with the emergence in the usage of online social platforms. Automatic stance detection system understands the user’s stance by analyzing the standalone texts against a target entity. Due to the limited contextual information a single sentence provides, it is challenging to solve this task effectively. In this paper, we introduce a Multi-Task Learning (MTL) based deep neural network architecture for automatically detecting stance present in the code-mixed corpus. We apply our approach on Hindi-English code-mixed corpus against the target entity - “Demonetisation.” Our best model achieved the result with a stance prediction accuracy of 63.2% which is a 4.5% overall accuracy improvement compared to the current supervised classification systems developed using the benchmark dataset for code-mixed data stance detection.

pdf
Deep Learning Techniques for Humor Detection in Hindi-English Code-Mixed Tweets
Sushmitha Reddy Sane | Suraj Tripathi | Koushik Reddy Sane | Radhika Mamidi
Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

We propose bilingual word embeddings based on word2vec and fastText models (CBOW and Skip-gram) to address the problem of Humor detection in Hindi-English code-mixed tweets in combination with deep learning architectures. We focus on deep learning approaches which are not widely used on code-mixed data and analyzed their performance by experimenting with three different neural network models. We propose convolution neural network (CNN) and bidirectional long-short term memory (biLSTM) (with and without Attention) models which take the generated bilingual embeddings as input. We make use of Twitter data to create bilingual word embeddings. All our proposed architectures outperform the state-of-the-art results, and Attention-based bidirectional LSTM model achieved an accuracy of 73.6% which is an increment of more than 4% compared to the current state-of-the-art results.

pdf
Detecting Political Bias in News Articles Using Headline Attention
Rama Rohit Reddy Gangula | Suma Reddy Duggenpudi | Radhika Mamidi
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Language is a powerful tool which can be used to state the facts as well as express our views and perceptions. Most of the times, we find a subtle bias towards or against someone or something. When it comes to politics, media houses and journalists are known to create bias by shrewd means such as misinterpreting reality and distorting viewpoints towards some parties. This misinterpretation on a large scale can lead to the production of biased news and conspiracy theories. Automating bias detection in newspaper articles could be a good challenge for research in NLP. We proposed a headline attention network for this bias detection. Our model has two distinctive characteristics: (i) it has a structure that mirrors a person’s way of reading a news article (ii) it has attention mechanism applied on the article based on its headline, enabling it to attend to more critical content to predict bias. As the required datasets were not available, we created a dataset comprising of 1329 news articles collected from various Telugu newspapers and marked them for bias towards a particular political party. The experiments conducted on it demonstrated that our model outperforms various baseline methods by a substantial margin.

2018

pdf
Towards Automation of Sense-type Identification of Verbs in OntoSenseNet
Sreekavitha Parupalli | Vijjini Anvesh Rao | Radhika Mamidi
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

In this paper, we discuss the enrichment of a manually developed resource, OntoSenseNet for Telugu. OntoSenseNet is a sense annotated resource that marks each verb of Telugu with a primary and a secondary sense. The area of research is relatively recent but has a large scope of development. We provide an introductory work to enrich the OntoSenseNet to promote further research in Telugu. Classifiers are adopted to learn the sense relevant features of the words in the resource and also to automate the tagging of sense-types for verbs. We perform a comparative analysis of different classifiers applied on OntoSenseNet. The results of the experiment prove that automated enrichment of the resource is effective using SVM classifiers and Adaboost ensemble.

pdf
Towards Enhancing Lexical Resource and Using Sense-annotations of OntoSenseNet for Sentiment Analysis
Sreekavitha Parupalli | Vijjini Anvesh Rao | Radhika Mamidi
Proceedings of the Third Workshop on Semantic Deep Learning

This paper illustrates the interface of the tool we developed for crowd sourcing and we explain the annotation procedure in detail. Our tool is named as ‘పారుపల్లి పదజాలం’ (Parupalli Padajaalam) which means web of words by Parupalli. The aim of this tool is to populate the OntoSenseNet, sentiment polarity annotated Telugu resource. Recent works have shown the importance of word-level annotations on sentiment analysis. With this as basis, we aim to analyze the importance of sense-annotations obtained from OntoSenseNet in performing the task of sentiment analysis. We explain the features extracted from OntoSenseNet (Telugu). Furthermore we compute and explain the adverbial class distribution of verbs in OntoSenseNet. This task is known to aid in disambiguating word-senses which helps in enhancing the performance of word-sense disambiguation (WSD) task(s).

pdf
BCSAT : A Benchmark Corpus for Sentiment Analysis in Telugu Using Word-level Annotations
Sreekavitha Parupalli | Vijjini Anvesh Rao | Radhika Mamidi
Proceedings of ACL 2018, Student Research Workshop

The presented work aims at generating a systematically annotated corpus that can support the enhancement of sentiment analysis tasks in Telugu using word-level sentiment annotations. From OntoSenseNet, we extracted 11,000 adjectives, 253 adverbs, 8483 verbs and sentiment annotation is being done by language experts. We discuss the methodology followed for the polarity annotations and validate the developed resource. This work aims at developing a benchmark corpus, as an extension to SentiWordNet, and baseline accuracy for a model where lexeme annotations are applied for sentiment predictions. The fundamental aim of this paper is to validate and study the possibility of utilizing machine learning algorithms, word-level sentiment annotations in the task of automated sentiment identification. Furthermore, accuracy is improved by annotating the bi-grams extracted from the target corpus.

pdf
Exploring Chunk Based Templates for Generating a subset of English Text
Nikhilesh Bhatnagar | Manish Shrivastava | Radhika Mamidi
Proceedings of ACL 2018, Student Research Workshop

Natural Language Generation (NLG) is a research task which addresses the automatic generation of natural language text representative of an input non-linguistic collection of knowledge. In this paper, we address the task of the generation of grammatical sentences in an isolated context given a partial bag-of-words which the generated sentence must contain. We view the task as a search problem (a problem of choice) involving combinations of smaller chunk based templates extracted from a training corpus to construct a complete sentence. To achieve that, we propose a fitness function which we use in conjunction with an evolutionary algorithm as the search procedure to arrive at a potentially grammatical sentence (modeled by the fitness score) which satisfies the input constraints.

pdf
Automatic Spelling Correction for Resource-Scarce Languages using Deep Learning
Pravallika Etoori | Manoj Chinnakotla | Radhika Mamidi
Proceedings of ACL 2018, Student Research Workshop

Spelling correction is a well-known task in Natural Language Processing (NLP). Automatic spelling correction is important for many NLP applications like web search engines, text summarization, sentiment analysis etc. Most approaches use parallel data of noisy and correct word mappings from different sources as training data for automatic spelling correction. Indic languages are resource-scarce and do not have such parallel data due to low volume of queries and non-existence of such prior implementations. In this paper, we show how to build an automatic spelling corrector for resource-scarce languages. We propose a sequence-to-sequence deep learning model which trains end-to-end. We perform experiments on synthetic datasets created for Indic languages, Hindi and Telugu, by incorporating the spelling mistakes committed at character level. A comparative evaluation shows that our model is competitive with the existing spell checking and correction techniques for Indic languages.

pdf
Resource Creation Towards Automated Sentiment Analysis in Telugu (a low resource language) and Integrating Multiple Domain Sources to Enhance Sentiment Prediction
Rama Rohit Reddy Gangula | Radhika Mamidi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Predicting the Genre and Rating of a Movie Based on its Synopsis
Varshit Battu | Vishal Batchu | Rama Rohit Reddy Gangula | Mohana Murali Krishna Reddy Dakannagari | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Word Level Language Identification in English Telugu Code Mixed Data
Sunil Gundapu | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Political Discourse Analysis : A Case Study of 2014 Andhra Pradesh State Assembly Election of Interpersonal Speech Choices
Lalitha Kameswari | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Affect in Tweets using Experts Model
Subba Reddy Oota | Adithya Avvaru | Mounika Reddy Marreddy | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Syllables for Sentence Classification in Morphologically Rich Languages
Madhuri Tummalapalli | Radhika Mamidi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

2017

pdf
Automatic Generation of Jokes in Hindi
Srishti Aggarwal | Radhika Mamidi
Proceedings of ACL 2017, Student Research Workshop

pdf bib
When does a compliment become sexist? Analysis and classification of ambivalent sexism using twitter data
Akshita Jha | Radhika Mamidi
Proceedings of the Second Workshop on NLP and Computational Social Science

Sexism is prevalent in today’s society, both offline and online, and poses a credible threat to social equality with respect to gender. According to ambivalent sexism theory (Glick and Fiske, 1996), it comes in two forms: Hostile and Benevolent. While hostile sexism is characterized by an explicitly negative attitude, benevolent sexism is more subtle. Previous works on computationally detecting sexism present online are restricted to identifying the hostile form. Our objective is to investigate the less pronounced form of sexism demonstrated online. We achieve this by creating and analyzing a dataset of tweets that exhibit benevolent sexism. By using Support Vector Machines (SVM), sequence-to-sequence models and FastText classifier, we classify tweets into ‘Hostile’, ‘Benevolent’ or ‘Others’ class depending on the kind of sexism they exhibit. We have been able to achieve an F1-score of 87.22% using FastText classifier. Our work helps analyze and understand the much prevalent ambivalent sexism in social media.

pdf
Building a SentiWordNet for Odia
Gaurav Mohanty | Abishek Kannan | Radhika Mamidi
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

As a discipline of Natural Language Processing, Sentiment Analysis is used to extract and analyze subjective information present in natural language data. The task of Sentiment Analysis has acquired wide commercial uses including social media monitoring tasks, survey responses, review systems, etc. Languages like English have several resources which aid in the task of Sentiment Analysis. SentiWordNet and Subjectivity WordList are examples of such tools and resources. With more data being available in native vernacular, language-specific SentiWordNet(s) have become essential. For resource poor languages, creating such SentiWordNet(s) is a difficult task to achieve. One solution is to use available resources in English and translate the final source lexicon to target lexicon via machine translation. Machine translation systems for the English-Odia language pair have not yet been developed. In this paper, we discuss a method to create a SentiWordNet for Odia, which is resource-poor, by only using resources which are currently available for Indian languages. The lexicon created, would serve as a tool for Sentiment Analysis related task specific to Odia data.

pdf
ACTSA: Annotated Corpus for Telugu Sentiment Analysis
Sandeep Sricharan Mukku | Radhika Mamidi
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

Sentiment analysis deals with the task of determining the polarity of a document or sentence and has received a lot of attention in recent years for the English language. With the rapid growth of social media these days, a lot of data is available in regional languages besides English. Telugu is one such regional language with abundant data available in social media, but it’s hard to find a labelled data of sentences for Telugu Sentiment Analysis. In this paper, we describe an effort to build a gold-standard annotated corpus of Telugu sentences to support Telugu Sentiment Analysis. The corpus, named ACTSA (Annotated Corpus for Telugu Sentiment Analysis) has a collection of Telugu sentences taken from different sources which were then pre-processed and manually annotated by native Telugu speakers using our annotation guidelines. In total, we have annotated 5457 sentences, which makes our corpus the largest resource currently available. The corpus and the annotation guidelines are made publicly available.

pdf
Handling Multi-Sentence Queries in a Domain Independent Dialogue System
Prathyusha Jwalapuram | Radhika Mamidi
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf
IIIT at SemEval-2016 Task 11: Complex Word Identification using Nearest Centroid Classification
Ashish Palakurthi | Radhika Mamidi
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text
Arnav Sharma | Sakshi Gupta | Raveesh Motlani | Piyush Bansal | Manish Shrivastava | Radhika Mamidi | Dipti M. Sharma
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Towards Building a SentiWordNet for Tamil
Abishek Kannan | Gaurav Mohanty | Radhika Mamidi
Proceedings of the 13th International Conference on Natural Language Processing

2015

pdf
Resolution of Pronominal Anaphora for Telugu Dialogues
Hemanth Reddy Jonnalagadda | Radhika Mamidi
Proceedings of the 12th International Conference on Natural Language Processing

pdf
A Semi Supervised Dialog Act Tagging for Telugu
Suman Dowlagar | Radhika Mamidi
Proceedings of the 12th International Conference on Natural Language Processing

pdf
Statistical Sandhi Splitter and its Effect on NLP Applications
Prathyusha Kuncham | Kovida Nelakuditi | Radhika Mamidi
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
Classification of Attributes in a Natural Language Query into Different SQL Clauses
Ashish Palakurthi | Ruthu S M | Arjun Akula | Radhika Mamidi
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf
Learning phrase-level vocabulary in second language using pictures/gestures and voice
Lavanya Prahallad | Prathyusha Danda | Radhika Mamidi
Proceedings of the 11th International Conference on Natural Language Processing

pdf
Identification of Karaka relations in an English sentence
Sai Kiran Gorthi | Ashish Palakurthi | Radhika Mamidi | Dipti Misra Sharma
Proceedings of the 11th International Conference on Natural Language Processing

pdf
Statistical Morph Analyzer (SMA++) for Indian Languages
Saikrishna Srirampur | Ravi Chandibhamar | Radhika Mamidi
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

2013

pdf
Stance Classification in Online Debates by Recognizing Users’ Intentions
Sarvesh Ranade | Rajeev Sangal | Radhika Mamidi
Proceedings of the SIGDIAL 2013 Conference

pdf
A Novel Approach Towards Incorporating Context Processing Capabilities in NLIDB System
Arjun Akula | Rajeev Sangal | Radhika Mamidi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages
Dipti Misra Sharma | Prashanth Mannem | Joseph vanGenabith | Sobha Lalitha Devi | Radhika Mamidi | Ranjani Parthasarathi
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages

pdf bib
Proceedings of the Workshop on Speech and Language Processing Tools in Education
Radhika Mamidi | Kishore Prahallad
Proceedings of the Workshop on Speech and Language Processing Tools in Education

pdf
A template matching approach for detecting pronunciation mismatch
Lavanya Prahallad | Radhika Mamidi | Kishore Prahallad
Proceedings of the Workshop on Speech and Language Processing Tools in Education

Search
Co-authors