Mayank Singh


MUTANT: A Multi-sentential Code-mixed Hinglish Dataset
Rahul Gupta | Vivek Srivastava | Mayank Singh
Findings of the Association for Computational Linguistics: EACL 2023

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research directions, we will make the dataset and the code publicly available upon publication.


HinglishEval Generation Challenge on Quality Estimation of Synthetic Code-Mixed Text: Overview and Results
Vivek Srivastava | Mayank Singh
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We hosted a shared task to investigate the factors influencing the quality of the code- mixed text generation systems. The teams experimented with two systems that gener- ate synthetic code-mixed Hinglish sentences. They also experimented with human ratings that evaluate the generation quality of the two systems. The first-of-its-kind, proposed sub- tasks, (i) quality rating prediction and (ii) an- notators’ disagreement prediction of the syn- thetic Hinglish dataset made the shared task quite popular among the multilingual research community. A total of 46 participants com- prising 23 teams from 18 institutions reg- istered for this shared task. The detailed description of the task and the leaderboard is available at https://codalab.lisn.

The Bull and the Bear: Summarizing Stock Market Discussions
Ayush Kumar | Dhyey Jani | Jay Shah | Devanshu Thakar | Varun Jain | Mayank Singh
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Stock market investors debate and heavily discuss stock ideas, investing strategies, news and market movements on social media platforms. The discussions are significantly longer in length and require extensive domain expertise for understanding. In this paper, we curate such discussions and construct a first-of-its-kind of abstractive summarization dataset. Our curated dataset consists of 7888 Reddit posts and manually constructed summaries for 400 posts. We robustly evaluate the summaries and conduct experiments on SOTA summarization tools to showcase their limitations. We plan to make the dataset publicly available. The sample dataset is available here:

The Inefficiency of Language Models in Scholarly Retrieval: An Experimental Walk-through
Shruti Singh | Mayank Singh
Findings of the Association for Computational Linguistics: ACL 2022

Language models are increasingly becoming popular in AI-powered scientific IR systems. This paper evaluates popular scientific language models in handling (i) short-query texts and (ii) textual neighbors. Our experiments showcase the inability to retrieve relevant documents for a short-query text even under the most relaxed conditions. Additionally, we leverage textual neighbors, generated by small perturbations to the original text, to demonstrate that not all perturbations lead to close neighbors in the embedding space. Further, an exhaustive categorization yields several classes of orthographically and semantically related, partially related and completely unrelated neighbors. Retrieval performance turns out to be more influenced by the surface form rather than the semantics of the text.

Overview and Results of MixMT Shared-Task at WMT 2022
Vivek Srivastava | Mayank Singh
Proceedings of the Seventh Conference on Machine Translation (WMT)

In this paper, we present an overview of the WMT 2022 shared task on code-mixed machine translation (MixMT). In this shared task, we hosted two code-mixed machine translation subtasks in the following settings: (i) monolingual to code-mixed translation and (ii) code-mixed to monolingual translation. In both the subtasks, we received registration and participation from teams across the globe showing an interest and need to immediately address the challenges with machine translation involving code-mixed and low-resource languages.


TweeNLP: A Twitter Exploration Portal for Natural Language Processing
Viraj Shah | Shruti Singh | Mayank Singh
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

We present TweeNLP, a one-stop portal that organizes Twitter’s natural language processing (NLP) data and builds a visualization and exploration platform. It curates 19,395 tweets (as of April 2021) from various NLP conferences and general NLP discussions. It supports multiple features such as TweetExplorer to explore tweets by topics, visualize insights from Twitter activity throughout the organization cycle of conferences, discover popular research papers and researchers. It also builds a timeline of conference and workshop submission deadlines. We envision TweeNLP to function as a collective memory unit for the NLP community by integrating the tweets pertaining to research papers with the NLPExplorer scientific literature search engine. The current system is hosted at

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation
Ayush Garg | Sammed Kagi | Vivek Srivastava | Mayank Singh
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text. Due to the high linguistic diversity, code-mixing presents several challenges in evaluating standard natural language generation (NLG) tasks. Various widely popular metrics perform poorly with the code-mixed NLG tasks. To address this challenge, we present a metric in- dependent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments on the generated code-mixed text. As a use case, we demonstrate the performance of MIPE on the machine-generated Hinglish (code-mixing of Hindi and English languages) sentences from the HinGE corpus. We can extend the proposed evaluation strategy to other code-mixed language pairs, NLG tasks, and evaluation metrics with minimal to no effort.

HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text
Vivek Srivastava | Mayank Singh
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generation and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the in- efficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.

pdf bib
Challenges and Limitations with the Metrics Measuring the Complexity of Code-Mixed Text
Vivek Srivastava | Mayank Singh
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-mixing is a frequent communication style among multilingual speakers where they mix words and phrases from two different languages in the same utterance of text or speech. Identifying and filtering code-mixed text is a challenging task due to its co-existence with monolingual and noisy text. Over the years, several code-mixing metrics have been extensively used to identify and validate code-mixed text quality. This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments.

PoliWAM: An Exploration of a Large Scale Corpus of Political Discussions on WhatsApp Messenger
Vivek Srivastava | Mayank Singh
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

WhatsApp Messenger is one of the most popular channels for spreading information with a current reach of more than 180 countries and 2 billion people. Its widespread usage has made it one of the most popular media for information propagation among the masses during any socially engaging event. In the recent past, several countries have witnessed its effectiveness and influence in political and social campaigns. We observe a high surge in information and propaganda flow during election campaigning. In this paper, we explore a high-quality large-scale user-generated dataset curated from WhatsApp comprising of 281 groups, 31,078 unique users, and 223,404 messages shared before, during, and after the Indian General Elections 2019, encompassing all major Indian political parties and leaders. In addition to the raw noisy user-generated data, we present a fine-grained annotated dataset of 3,848 messages that will be useful to understand the various dimensions of WhatsApp political campaigning. We present several complementary insights into the investigative and sensational news stories from the same period. Exploratory data analysis and experiments showcase several exciting results and future research opportunities. To facilitate reproducible research, we make the anonymized datasets available in the public domain.

Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text
Vivek Srivastava | Mayank Singh
Proceedings of the 14th International Conference on Natural Language Generation

In this shared task, we seek the participating teams to investigate the factors influencing the quality of the code-mixed text generation systems. We synthetically generate code-mixed Hinglish sentences using two distinct approaches and employ human annotators to rate the generation quality. We propose two subtasks, quality rating prediction and annotators’ disagreement prediction of the synthetic Hinglish dataset. The proposed subtasks will put forward the reasoning and explanation of the factors influencing the quality and human perception of the code-mixed text.


IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment Classification Using Candidate Sentence Generation and Selection
Vivek Srivastava | Mayank Singh
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Code-mixing is the phenomenon of using multiple languages in the same utterance. It is a frequently used pattern of communication on social media sites such as Facebook, Twitter, etc. Sentiment analysis of the monolingual text is a well-studied task. Code-mixing adds to the challenge of analyzing the sentiment of the text on various platforms such as social media, online gaming, forums, product reviews, etc. We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier to classify the Hinglish code-mixed text into one of the three sentiment classes positive, negative, or neutral. The proposed candidate sentence generation and selection based approach show an improvement in the system performance as compared to the Bi-LSTM based neural classifier. We can extend the proposed method to solve other problems with code-mixing in the textual data, such as humor-detection, intent classification, etc.

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation
Vivek Srivastava | Mayank Singh
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Code-mixing is the phenomenon of using more than one language in a sentence. In the multilingual communities, it is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, the noisy user-generated code-mixed text adds to the challenge of processing and understanding natural language to a much larger extent. Machine translation from monolingual source to the target language is a well-studied research problem. Here, we demonstrate that widely popular and sophisticated translation systems such as Google Translate fail at times to translate code-mixed text effectively. To address this challenge, we present a parallel corpus of the 13,738 code-mixed Hindi-English sentences and their corresponding human translation in English. In addition, we also propose a translation pipeline build on top of Google Translate. The evaluation of the proposed pipeline on PHINC demonstrates an increase in the performance of the underlying system. With minimal effort, we can extend the dataset and the proposed approach to other code-mixing language pairs.


IIT Gandhinagar at SemEval-2019 Task 3: Contextual Emotion Detection Using Deep Learning
Arik Pamnani | Rajat Goel | Jayesh Choudhari | Mayank Singh
Proceedings of the 13th International Workshop on Semantic Evaluation

Recent advancements in Internet and Mobile infrastructure have resulted in the development of faster and efficient platforms of communication. These platforms include speech, facial and text-based conversational mediums. Majority of these are text-based messaging platforms. Development of Chatbots that automatically understand latent emotions in the textual message is a challenging task. In this paper, we present an automatic emotion detection system that aims to detect the emotion of a person textually conversing with a chatbot. We explore deep learning techniques such as CNN and LSTM based neural networks and outperformed the baseline score by 14%. The trained model and code are kept in public domain.


CL Scholar: The ACL Anthology Knowledge Graph Miner
Mayank Singh | Pradeep Dogga | Sohan Patro | Dhiraj Barnwal | Ritam Dutt | Rajarshi Haldar | Pawan Goyal | Animesh Mukherjee
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present CL Scholar, the ACL Anthology knowledge graph miner to facilitate high-quality search and exploration of current research progress in the computational linguistics community. In contrast to previous works, periodically crawling, indexing and processing of new incoming articles is completely automated in the current system. CL Scholar utilizes both textual and network information for knowledge graph construction. As an additional novel initiative, CL Scholar supports more than 1200 scholarly natural language queries along with standard keyword-based search on constructed knowledge graph. It answers binary, statistical and list based natural language queries. The current system is deployed at We also provide REST API support along with bulk download facility. Our code and data are available at


OCR++: A Robust Framework For Information Extraction from Scholarly Articles
Mayank Singh | Barnopriyo Barua | Priyank Palod | Manvi Garg | Sidhartha Satapathy | Samuel Bushi | Kumar Ayush | Krishna Sai Rohith | Tulasi Gamidi | Pawan Goyal | Animesh Mukherjee
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper proposes OCR++, an open-source framework designed for a variety of information extraction tasks from scholarly articles including metadata (title, author names, affiliation and e-mail), structure (section headings and body text, table and figure headings, URLs and footnotes) and bibliography (citation instances and references). We analyze a diverse set of scientific articles written in English to understand generic writing patterns and formulate rules to develop this hybrid framework. Extensive evaluations show that the proposed framework outperforms the existing state-of-the-art tools by a large margin in structural information extraction along with improved performance in metadata and bibliography extraction tasks, both in terms of accuracy (around 50% improvement) and processing time (around 52% improvement). A user experience study conducted with the help of 30 researchers reveals that the researchers found this system to be very helpful. As an additional objective, we discuss two novel use cases including automatically extracting links to public datasets from the proceedings, which would further accelerate the advancement in digital libraries. The result of the framework can be exported as a whole into structured TEI-encoded documents. Our framework is accessible online at