Ponnurangam Kumaraguru


2022

pdf
CamPros at CASE 2022 Task 1: Transformer-based Multilingual Protest News Detection
Neha Kumari | Mrinal Anand | Tushar Mohan | Ponnurangam Kumaraguru | Arun Balaji Buduru
Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

Socio-political protests often lead to grave consequences when they occur. The early detection of such protests is very important for taking early precautionary measures. However, the main shortcoming of protest event detection is the scarcity of sufficient training data for specific language categories, which makes it difficult to train data-hungry deep learning models effectively. Therefore, cross-lingual and zero-shot learning models are needed to detect events in various low-resource languages. This paper proposes a multi-lingual cross-document level event detection approach using pre-trained transformer models developed for Shared Task 1 at CASE 2022. The shared task constituted four subtasks for event detection at different granularity levels, i.e., document level to token level, spread over multiple languages (English, Spanish, Portuguese, Turkish, Urdu, and Mandarin). Our system achieves an average F1 score of 0.73 for document-level event detection tasks. Our approach secured 2nd position for the Hindi language in subtask 1 with an F1 score of 0.80. While for Spanish, we secure 4th position with an F1 score of 0.69. Our code is available at https://github.com/nehapspathak/campros/.

pdf
HashSet - A Dataset For Hashtag Segmentation
Prashant Kodali | Akshala Bhatnagar | Naman Ahuja | Manish Shrivastava | Ponnurangam Kumaraguru
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Hashtag segmentation is the task of breaking a hashtag into its constituent tokens. Hashtags often encode the essence of user-generated posts, along with information like topic and sentiment, which are useful in downstream tasks. Hashtags prioritize brevity and are written in unique ways - transliterating and mixing languages, spelling variations, creative named entities. Benchmark datasets used for the hashtag segmentation task - STAN, BOUN - are small and extracted from a single set of tweets. However, datasets should reflect the variations in writing styles of hashtags and account for domain and language specificity, failing which the results will misrepresent model performance. We argue that model performance should be assessed on a wider variety of hashtags, and datasets should be carefully curated. To this end, we propose HashSet, a dataset comprising of: a) 1.9k manually annotated dataset; b) 3.3M loosely supervised dataset. HashSet dataset is sampled from a different set of tweets when compared to existing datasets and provides an alternate distribution of hashtags to build and validate hashtag segmentation models. We analyze the performance of SOTA models for Hashtag Segmentation, and show that the proposed dataset provides an alternate set of hashtags to train and assess models.

pdf
Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts
Shrey Gupta | Anmol Agarwal | Manas Gaur | Kaushik Roy | Vignesh Narayanan | Ponnurangam Kumaraguru | Amit Sheth
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

Conversational Agents (CAs) powered with deep language models (DLMs) have shown tremendous promise in the domain of mental health. Prominently, the CAs have been used to provide informational or therapeutic services (e.g., cognitive behavioral therapy) to patients. However, the utility of CAs to assist in mental health triaging has not been explored in the existing work as it requires a controlled generation of follow-up questions (FQs), which are often initiated and guided by the mental health professionals (MHPs) in clinical settings. In the context of ‘depression’, our experiments show that DLMs coupled with process knowledge in a mental health questionnaire generate 12.54% and 9.37% better FQs based on similarity and longest common subsequence matches to questions in the PHQ-9 dataset respectively, when compared with DLMs without process knowledge support.Despite coupling with process knowledge, we find that DLMs are still prone to hallucination, i.e., generating redundant, irrelevant, and unsafe FQs. We demonstrate the challenge of using existing datasets to train a DLM for generating FQs that adhere to clinical process knowledge. To address this limitation, we prepared an extended PHQ-9 based dataset, PRIMATE, in collaboration with MHPs. PRIMATE contains annotations regarding whether a particular question in the PHQ-9 dataset has already been answered in the user’s initial description of the mental health condition. We used PRIMATE to train a DLM in a supervised setting to identify which of the PHQ-9 questions can be answered directly from the user’s post and which ones would require more information from the user. Using performance analysis based on MCC scores, we show that PRIMATE is appropriate for identifying questions in PHQ-9 that could guide generative DLMs towards controlled FQ generation (with minimal hallucination) suitable for aiding triaging. The dataset created as a part of this research can be obtained from https://github.com/primate-mh/Primate2022

pdf
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing
Prashant Kodali | Anmol Goel | Monojit Choudhury | Manish Shrivastava | Ponnurangam Kumaraguru
Findings of the Association for Computational Linguistics: ACL 2022

Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.

pdf
HLDC: Hindi Legal Documents Corpus
Arnav Kapoor | Mudit Dhawan | Anmol Goel | Arjun T H | Akshala Bhatnagar | Vibhu Agrawal | Amul Agrawal | Arnab Bhattacharya | Ponnurangam Kumaraguru | Ashutosh Modi
Findings of the Association for Computational Linguistics: ACL 2022

Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area.

pdf
An Unsupervised, Geometric and Syntax-aware Quantification of Polysemy
Anmol Goel | Charu Sharma | Ponnurangam Kumaraguru
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Polysemy is the phenomenon where a single word form possesses two or more related senses. It is an extremely ubiquitous part of natural language and analyzing it has sparked rich discussions in the linguistics, psychology and philosophy communities alike. With scarce attention paid to polysemy in computational linguistics, and even scarcer attention toward quantifying polysemy, in this paper, we propose a novel, unsupervised framework to compute and estimate polysemy scores for words in multiple languages. We infuse our proposed quantification with syntactic knowledge in the form of dependency structures. This informs the final polysemy scores of the lexicon motivated by recent linguistic findings that suggest there is an implicit relation between syntax and ambiguity/polysemy. We adopt a graph based approach by computing the discrete Ollivier Ricci curvature on a graph of the contextual nearest neighbors. We test our framework on curated datasets controlling for different sense distributions of words in 3 typologically diverse languages - English, French and Spanish. The effectiveness of our framework is demonstrated by significant correlations of our quantification with expert human annotated language resources like WordNet. We observe a 0.3 point increase in the correlation coefficient as compared to previous quantification studies in English. Our research leverages contextual language models and syntactic structures to empirically support the widely held theoretical linguistic notion that syntax is intricately linked to ambiguity/polysemy.

pdf
PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language Model Embeddings To Estimate Code-Mix Quality
Prashant Kodali | Tanmay Sachan | Akshay Goindani | Anmol Goel | Naman Ahuja | Manish Shrivastava | Ponnurangam Kumaraguru
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine gen- erated code-mixed text is an open problem. In our submission to HinglishEval, a shared- task collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by pre- dicting ratings for code-mix quality. Hingli- shEval Shared Task consists of two sub-tasks - a) Quality rating prediction); b) Disagree- ment prediction. We leverage popular code- mixed metrics and embeddings of multilin- gual large language models (MLLMs) as fea- tures, and train task specific MLP regression models. Our approach could not beat the baseline results. However, for Subtask-A our team ranked a close second on F-1 and Co- hen’s Kappa Score measures and first for Mean Squared Error measure. For Subtask-B our ap- proach ranked third for F1 score, and first for Mean Squared Error measure. Code of our submission can be accessed here.

2021

pdf
An Exploratory Study on Temporally Evolving Discussion around Covid-19 using Diachronic Word Embeddings
Avinash Tulasi | Asanobu Kitamoto | Ponnurangam Kumaraguru | Arun Balaji Buduru
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

Covid 19 has seen the world go into a lock down and unconventional social situations throughout. During this time, the world saw a surge in information sharing around the pandemic and the topics shared in the time were diverse. People’s sentiments have changed during this period. Given the wide spread usage of Online Social Networks (OSN) and support groups, the user sentiment is well reflected in online discussions. In this work, we aim to show the topics under discussion, evolution of discussions, change in user sentiment during the pandemic. Alongside which, we also demonstrate the possibility of exploratory analysis to find pressing topics, change in perception towards the topics and ways to use the knowledge extracted from online discussions. For our work we employ Diachronic Word embeddings which capture the change in word usage over time. With the help of analysis from temporal word usages, we show the change in people’s option on covid-19 from being a conspiracy, to the post-covid topics that surround vaccination.

pdf
CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences
Devansh Gautam | Prashant Kodali | Kshitij Gupta | Anmol Goel | Manish Shrivastava | Ponnurangam Kumaraguru
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-mixed languages are very popular in multilingual societies around the world, yet the resources lag behind to enable robust systems on such languages. A major contributing factor is the informal nature of these languages which makes it difficult to collect code-mixed data. In this paper, we propose our system for Task 1 of CACLS 2021 to generate a machine translation system for English to Hinglish in a supervised setting. Translating in the given direction can help expand the set of resources for several tasks by translating valuable datasets from high resource languages. We propose to use mBART, a pre-trained multilingual sequence-to-sequence model, and fully utilize the pre-training of the model by transliterating the roman Hindi words in the code-mixed sentences to Devanagri script. We evaluate how expanding the input by concatenating Hindi translations of the English sentences improves mBART’s performance. Our system gives a BLEU score of 12.22 on test set. Further, we perform a detailed error analysis of our proposed systems and explore the limitations of the provided dataset and metrics.

2020

pdf
AbuseAnalyzer: Abuse Detection, Severity and Target Prediction for Gab Posts
Mohit Chandra | Ashwin Pathak | Eesha Dutta | Paryul Jain | Manish Gupta | Manish Shrivastava | Ponnurangam Kumaraguru
Proceedings of the 28th International Conference on Computational Linguistics

While extensive popularity of online social media platforms has made information dissemination faster, it has also resulted in widespread online abuse of different types like hate speech, offensive language, sexist and racist opinions, etc. Detection and curtailment of such abusive content is critical for avoiding its psychological impact on victim communities, and thereby preventing hate crimes. Previous works have focused on classifying user posts into various forms of abusive behavior. But there has hardly been any focus on estimating the severity of abuse and the target. In this paper, we present a first of the kind dataset with 7,601 posts from Gab which looks at online abuse from the perspective of presence of abuse, severity and target of abusive behavior. We also propose a system to address these tasks, obtaining an accuracy of ∼80% for abuse presence, ∼82% for abuse target prediction, and ∼65% for abuse severity prediction.

2018

pdf
A Twitter Corpus for Hindi-English Code Mixed POS Tagging
Kushagra Singh | Indira Sen | Ponnurangam Kumaraguru
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

Code-mixing is a linguistic phenomenon where multiple languages are used in the same occurrence that is increasingly common in multilingual societies. Code-mixed content on social media is also on the rise, prompting the need for tools to automatically understand such content. Automatic Parts-of-Speech (POS) tagging is an essential step in any Natural Language Processing (NLP) pipeline, but there is a lack of annotated data to train such models. In this work, we present a unique language tagged and POS-tagged dataset of code-mixed English-Hindi tweets related to five incidents in India that led to a lot of Twitter activity. Our dataset is unique in two dimensions: (i) it is larger than previous annotated datasets and (ii) it closely resembles typical real-world tweets. Additionally, we present a POS tagging model that is trained on this dataset to provide an example of how this dataset can be used. The model also shows the efficacy of our dataset in enabling the creation of code-mixed social media POS taggers.

pdf
Neural Machine Translation for English-Tamil
Himanshu Choudhary | Aditya Kumar Pathak | Rajiv Ratan Saha | Ponnurangam Kumaraguru
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

A huge amount of valuable resources is available on the web in English, which are often translated into local languages to facilitate knowledge sharing among local people who are not much familiar with English. However, translating such content manually is very tedious, costly, and time-consuming process. To this end, machine translation is an efficient approach to translate text without any human involvement. Neural machine translation (NMT) is one of the most recent and effective translation technique amongst all existing machine translation systems. In this paper, we apply NMT for English-Tamil language pair. We propose a novel neural machine translation technique using word-embedding along with Byte-Pair-Encoding (BPE) to develop an efficient translation system that overcomes the OOV (Out Of Vocabulary) problem for languages which do not have much translations available online. We use the BLEU score for evaluating the system performance. Experimental results confirm that our proposed MIDAS translator (8.33 BLEU score) outperforms Google translator (3.75 BLEU score).

pdf
Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets
Kushagra Singh | Indira Sen | Ponnurangam Kumaraguru
Proceedings of ACL 2018, Student Research Workshop

While growing code-mixed content on Online Social Networks(OSN) provides a fertile ground for studying various aspects of code-mixing, the lack of automated text analysis tools render such studies challenging. To meet this challenge, a family of tools for analyzing code-mixed data such as language identifiers, parts-of-speech (POS) taggers, chunkers have been developed. Named Entity Recognition (NER) is an important text analysis task which is not only informative by itself, but is also needed for downstream NLP tasks such as semantic role labeling. In this work, we present an exploration of automatic NER of code-mixed data. We compare our method with existing off-the-shelf NER tools for social media content,and find that our systems outperforms the best baseline by 33.18 % (F1 score).