Chitta Baral.

Also published as: Chitta Baral


2022

pdf
ILDAE: Instance-Level Difficulty Analysis of Evaluation Data
Neeraj Varshney | Swaroop Mishra | Chitta Baral
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge of difficulty level of questions helps a teacher in several ways, such as estimating students’ potential quickly by asking carefully selected questions and improving quality of examination by modifying trivial and hard questions. Can we extract such benefits of instance difficulty in Natural Language Processing? To this end, we conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets and demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer instances saving computational cost and time, 2) improving quality of existing evaluation datasets by repairing erroneous and trivial instances, 3) selecting the best model based on application requirements, 4) analyzing dataset characteristics for guiding future data creation, 5) estimating Out-of-Domain performance reliably. Comprehensive experiments for these applications lead to several interesting results, such as evaluation using just 5% instances (selected via ILDAE) achieves as high as 0.93 Kendall correlation with evaluation using complete dataset and computing weighted accuracy using difficulty scores leads to 5.2% higher correlation with Out-of-Domain performance. We release the difficulty scores and hope our work will encourage research in this important yet understudied field of leveraging instance difficulty in evaluations.

pdf
Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Swaroop Mishra | Daniel Khashabi | Chitta Baral | Hannaneh Hajishirzi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. Despite the success of the conventional supervised learning on individual datasets, such models often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that learns a new task by understanding the human-readable instructions that define it. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances (input-output pairs). The instructions are obtained from crowdsourcing instructions used to create existing NLP datasets and mapped to a unified schema. Using this meta-dataset, we measure cross-task generalization by training models on seen tasks and measuring generalization to the remaining unseen ones. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models benefit from instructions when evaluated in terms of generalization to unseen tasks (19% better for models utilizing instructions). These models, however, are far behind an estimated performance upperbound indicating significant room for more progress in this direction.

pdf
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks
Swaroop Mishra | Arindam Mitra | Neeraj Varshney | Bhavdeep Sachdeva | Peter Clark | Chitta Baral | Ashwin Kalyan
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE that was proposed in the context of natural language understanding, we propose NumGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4 %). Further, NumGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4 % on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.

pdf
To Find Waldo You Need Contextual Cues: Debiasing Who’s Waldo
Yiran Luo | Pratyay Banerjee | Tejas Gokhale | Yezhou Yang | Chitta Baral
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present a debiased dataset for the Person-centric Visual Grounding (PCVG) task first proposed by Cui et al. (2021) in the Who’s Waldo dataset. Given an image and a caption, PCVG requires pairing up a person’s name mentioned in a caption with a bounding box that points to the person in the image. We find that the original Who’s Waldo dataset compiled for this task contains a large number of biased samples that are solvable simply by heuristic methods; for instance, in many cases the first name in the sentence corresponds to the largest bounding box, or the sequence of names in the sentence corresponds to an exact left-to-right order in the image. Naturally, models trained on these biased data lead to over-estimation of performance on the benchmark. To enforce models being correct for the correct reasons, we design automated tools to filter and debias the original dataset by ruling out all examples of insufficient context, such as those with no verb or with a long chain of conjunct names in their captions. Our experiments show that our new sub-sampled dataset contains less bias with much lowered heuristic performances and widened gaps between heuristic and supervised methods. We also demonstrate the same benchmark model trained on our debiased training set outperforms that trained on the original biased (and larger) training set on our debiased test set. We argue our debiased dataset offers the PCVG task a more practical baseline for reliable benchmarking and future improvements.

pdf
Towards Improving Selective Prediction Ability of NLP Systems
Neeraj Varshney | Swaroop Mishra | Chitta Baral
Proceedings of the 7th Workshop on Representation Learning for NLP

It’s better to say “I can’t answer” than to answer incorrectly. This selective prediction ability is crucial for NLP systems to be reliably deployed in real-world applications. Prior work has shown that existing selective prediction techniques fail to perform well, especially in the out-of-domain setting. In this work, we propose a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances. Using these two signals, we first annotate held-out instances and then train a calibrator to predict the likelihood of correctness of the model’s prediction. We instantiate our method with Natural Language Inference (NLI) and Duplicate Detection (DD) tasks and evaluate it in both In-Domain (IID) and Out-of-Domain (OOD) settings. In (IID, OOD) settings, we show that the representations learned by our calibrator result in an improvement of (15.81%, 5.64%) and (6.19%, 13.9%) over ‘MaxProb’ -a selective prediction baseline- on NLI and DD tasks respectively.

pdf
Reframing Instructional Prompts to GPTk’s Language
Swaroop Mishra | Daniel Khashabi | Chitta Baral | Yejin Choi | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: ACL 2022

What kinds of instructional prompts are easier to follow for Language Models (LMs)? We study this question by conducting extensive empirical analysis that shed light on important features of successful instructional prompts. Specifically, we study several classes of reframing techniques for manual reformulation of prompts into more effective ones. Some examples include decomposing a complex task instruction into multiple simpler tasks or itemizing instructions into sequential steps. Our experiments compare the zero-shot and few-shot performance of LMs prompted with reframed instructions on 12 NLP tasks across 6 categories. Compared with original instructions, our reframed instructions lead to significant improvements across LMs with different sizes. For example, the same reframed prompts boost few-shot performance of GPT3-series and GPT2-series by 12.5% and 6.7% respectively averaged over all tasks. Furthermore, reframed instructions reduce the number of examples required to prompt LMs in the few-shot setting. We hope these empirically-driven techniques will pave the way towards more effective future prompting algorithms.

pdf
Semantically Distributed Robust Optimization for Vision-and-Language Inference
Tejas Gokhale | Abhishek Chaudhary | Pratyay Banerjee | Chitta Baral | Yezhou Yang
Findings of the Association for Computational Linguistics: ACL 2022

Analysis of vision-and-language models has revealed their brittleness under linguistic phenomena such as paraphrasing, negation, textual entailment, and word substitutions with synonyms or antonyms.While data augmentation techniques have been designed to mitigate against these failure modes, methods that can integrate this knowledge into the training pipeline remain under-explored.In this paper, we present SDRO, a model-agnostic method that utilizes a set linguistic transformations in a distributed robust optimization setting, along with an ensembling technique to leverage these transformations during inference.Experiments on benchmark datasets with images (NLVR2) and video (VIOLIN) demonstrate performance improvements as well as robustness to adversarial attacks.Experiments on binary VQA explore the generalizability of this method to other V&L tasks.

pdf
Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings
Neeraj Varshney | Swaroop Mishra | Chitta Baral
Findings of the Association for Computational Linguistics: ACL 2022

In order to equip NLP systems with ‘selective prediction’ capability, several task-specific approaches have been proposed. However, which approaches work best across tasks or even if they consistently outperform the simplest baseline MaxProb remains to be explored. To this end, we systematically study selective prediction in a large-scale setup of 17 datasets across several NLP tasks. Through comprehensive experiments under in-domain (IID), out-of-domain (OOD), and adversarial (ADV) settings, we show that despite leveraging additional resources (held-out data/computation), none of the existing approaches consistently and considerably outperforms MaxProb in all three settings. Furthermore, their performance does not translate well across tasks. For instance, Monte-Carlo Dropout outperforms all other approaches on Duplicate Detection datasets but does not fare well on NLI datasets, especially in the OOD setting. Thus, we recommend that future selective prediction approaches should be evaluated across tasks and settings for reliable estimation of their capabilities.

pdf
Unsupervised Natural Language Inference Using PHL Triplet Generation
Neeraj Varshney | Pratyay Banerjee | Tejas Gokhale | Chitta Baral
Findings of the Association for Computational Linguistics: ACL 2022

Transformer-based models achieve impressive performance on numerous Natural Language Inference (NLI) benchmarks when trained on respective training datasets. However, in certain cases, training samples may not be available or collecting them could be time-consuming and resource-intensive. In this work, we address the above challenge and present an explorative study on unsupervised NLI, a paradigm in which no human-annotated training samples are available. We investigate it under three settings: PH, P, and NPH that differ in the extent of unlabeled data available for learning. As a solution, we propose a procedural data generation approach that leverages a set of sentence transformations to collect PHL (Premise, Hypothesis, Label) triplets for training NLI models, bypassing the need for human-annotated training data. Comprehensive experiments with several NLI datasets show that the proposed approach results in accuracies of up to 66.75%, 65.9%, 65.39% in PH, P, and NPH settings respectively, outperforming all existing unsupervised baselines. Furthermore, fine-tuning our model with as little as ~0.1% of the human-annotated training dataset (500 instances) leads to 12.2% higher accuracy than the model trained from scratch on the same 500 instances. Supported by this superior performance, we conclude with a recommendation for collecting high-quality task-specific data.

pdf
Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial Robustness
Tejas Gokhale | Swaroop Mishra | Man Luo | Bhavdeep Sachdeva | Chitta Baral
Findings of the Association for Computational Linguistics: ACL 2022

Data modification, either via additional training datasets, data augmentation, debiasing, and dataset filtering, has been proposed as an effective solution for generalizing to out-of-domain (OOD) inputs, in both natural language processing and computer vision literature.However, the effect of data modification on adversarial robustness remains unclear.In this work, we conduct a comprehensive study of common data modification strategies and evaluate not only their in-domain and OOD performance, but also their adversarial robustness (AR).We also present results on a two-dimensional synthetic dataset to visualize the effect of each method on the training distribution.This work serves as an empirical study towards understanding the relationship between generalizing to unseen domains and defending against adversarial perturbations.Our findings suggest that more data (either via additional datasets or data augmentation) benefits both OOD accuracy and AR.However, data filtering (previously shown to improve OOD accuracy on natural language inference) hurts OOD accuracy on other tasks such as question answering and image classification.We provide insights from our experiments to inform future work in this direction.

pdf
In-BoXBART: Get Instructions into Biomedical Multi-Task Learning
Mihir Parmar | Swaroop Mishra | Mirali Purohit | Man Luo | Murad Mohammad | Chitta Baral
Findings of the Association for Computational Linguistics: NAACL 2022

Single-task models have proven pivotal in solving specific tasks; however, they have limitations in real-world applications where multi-tasking is necessary and domain shifts are exhibited. Recently, instructional prompts have shown significant improvement towards multi-task generalization; however, the effect of instructional prompts and Multi-Task Learning (MTL) has not been systematically studied in the biomedical domain. Motivated by this, this paper explores the impact of instructional prompts for biomedical MTL. We introduce the BoX, a collection of 32 instruction tasks for Biomedical NLP across (X) various categories. Using this meta-dataset, we propose a unified model termed as In-BoXBART, that can jointly learn all tasks of the BoX without any task-specific modules. To the best of our knowledge, this is the first attempt to propose a unified model in the biomedical domain and use instructions to achieve generalization across several biomedical tasks. Experimental results indicate that the proposed model: 1) outperforms single-task baseline by ~3% and multi-task (without instruction) baseline by ~18% on an average, and 2) shows ~23% improvement compared to single-task baseline in few-shot learning (i.e., 32 instances per task) on an average. Our analysis indicates that there is significant room for improvement across tasks in the BoX, implying the scope for future research direction.

pdf
Let the Model Decide its Curriculum for Multitask Learning
Neeraj Varshney | Swaroop Mishra | Chitta Baral
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Curriculum learning strategies in prior multitask learning approaches arrange datasets in a difficulty hierarchy either based on human perception or by exhaustively searching the optimal arrangement. However, human perception of difficulty may not always correlate well with machine interpretation leading to poor performance and exhaustive search is computationally expensive. Addressing these concerns, we propose two classes of techniques to arrange training instances into a learning curriculum based on difficulty scores computed via model-based approaches. The two classes i.e Dataset-level and Instance-level differ in granularity of arrangement. Through comprehensive experiments with 12 datasets, we show that instance-level and dataset-level techniques result in strong representations as they lead to an average performance improvement of 4.17% and 3.15% over their respective baselines. Furthermore, we find that most of this improvement comes from correctly answering the difficult instances, implying a greater efficacy of our techniques on difficult tasks

pdf
A Simple Approach to Jointly Rank Passages and Select Relevant Sentences in the OBQA Context
Man Luo | Shuguang Chen | Chitta Baral
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

In the open book question answering (OBQA) task, selecting the relevant passages and sentences from distracting information is crucial to reason the answer to a question. HotpotQA dataset is designed to teach and evaluate systems to do both passage ranking and sentence selection. Many existing frameworks use separate models to select relevant passages and sentences respectively. Such systems not only have high complexity in terms of the parameters of models but also fail to take the advantage of training these two tasks together since one task can be beneficial for the other one. In this work, we present a simple yet effective framework to address these limitations by jointly ranking passages and selecting sentences. Furthermore, we propose consistency and similarity constraints to promote the correlation and interaction between passage ranking and sentence selection.The experiments demonstrate that our framework can achieve competitive results with previous systems and outperform the baseline by 28% in terms of exact matching of relevant sentences on the HotpotQA dataset.

pdf
Choose Your QA Model Wisely: A Systematic Study of Generative and Extractive Readers for Question Answering
Man Luo | Kazuma Hashimoto | Semih Yavuz | Zhiwei Liu | Chitta Baral | Yingbo Zhou
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge

While both extractive and generative readers have been successfully applied to the Question Answering (QA) task, little attention has been paid toward the systematic comparison of them. Characterizing the strengths and weaknesses of the two readers is crucial not only for making a more informed reader selection in practice but also for developing a deeper understanding to foster further research on improving readers in a principled manner. Motivated by this goal, we make the first attempt to systematically study the comparison of extractive and generative readers for question answering. To be aligned with the state-of-the-art, we explore nine transformer-based large pre-trained language models (PrLMs) as backbone architectures. Furthermore, we organize our findings under two main categories: (1) keeping the architecture invariant, and (2) varying the underlying PrLMs. Among several interesting findings, it is important to highlight that (1) the generative readers perform better in long context QA, (2) the extractive readers perform better in short context while also showing better out-of-domain generalization, and (3) the encoder of encoder-decoder PrLMs (e.g., T5) turns out to be a strong extractive reader and outperforms the standard choice of encoder-only PrLMs (e.g., RoBERTa). We also study the effect of multi-task learning on the two types of readers varying the underlying PrLMs and perform qualitative and quantitative diagnosis to provide further insights into future directions in modeling better readers.

2021

pdf
WeaQA: Weak Supervision via Captions for Visual Question Answering
Pratyay Banerjee | Tejas Gokhale | Yezhou Yang | Chitta Baral
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf
Constructing Flow Graphs from Procedural Cybersecurity Texts
Kuntal Kumar Pal | Kazuaki Kashihara | Pratyay Banerjee | Swaroop Mishra | Ruoyu Wang | Chitta Baral
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf
Investigating Numeracy Learning Ability of a Text-to-Text Transfer Model
Kuntal Kumar Pal | Chitta Baral
Findings of the Association for Computational Linguistics: EMNLP 2021

The transformer-based pre-trained language models have been tremendously successful in most of the conventional NLP tasks. But they often struggle in those tasks where numerical understanding is required. Some possible reasons can be the tokenizers and pre-training objectives which are not specifically designed to learn and preserve numeracy. Here we investigate the ability of text-to-text transfer learning model (T5), which has outperformed its predecessors in the conventional NLP tasks, to learn numeracy. We consider four numeracy tasks: numeration, magnitude order prediction, finding minimum and maximum in a series, and sorting. We find that, although T5 models perform reasonably well in the interpolation setting, they struggle considerably in the extrapolation setting across all four tasks.

pdf
‘Just because you are right, doesn’t mean I am wrong’: Overcoming a bottleneck in development and evaluation of Open-Ended VQA tasks
Man Luo | Shailaja Keyur Sampat | Riley Tallman | Yankai Zeng | Manuha Vancha | Akarshan Sajja | Chitta Baral
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

GQA (CITATION) is a dataset for real-world visual reasoning and compositional question answering. We found that many answers predicted by the best vision-language models on the GQA dataset do not match the ground-truth answer but still are semantically meaningful and correct in the given context. In fact, this is the case with most existing visual question answering (VQA) datasets where they assume only one ground-truth answer for each question. We propose Alternative Answer Sets (AAS) of ground-truth answers to address this limitation, which is created automatically using off-the-shelf NLP tools. We introduce a semantic metric based on AAS and modify top VQA solvers to support multiple plausible answers for a question. We implement this approach on the GQA dataset and show the performance improvements.

pdf
Self-Supervised Test-Time Learning for Reading Comprehension
Pratyay Banerjee | Tejas Gokhale | Chitta Baral
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent work on unsupervised question answering has shown that models can be trained with procedurally generated question-answer pairs and can achieve performance competitive with supervised methods. In this work, we consider the task of unsupervised reading comprehension and present a method that performs “test-time learning” (TTL) on a given context (text passage), without requiring training on large-scale human-authored datasets containing context-question-answer triplets. This method operates directly on a single test context, uses self-supervision to train models on synthetically generated question-answer pairs, and then infers answers to unseen human-authored questions for this context. Our method achieves accuracies competitive with fully supervised methods and significantly outperforms current unsupervised methods. TTL methods with a smaller model are also competitive with the current state-of-the-art in unsupervised reading comprehension.

pdf
CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images
Shailaja Keyur Sampat | Akshay Kumar | Yezhou Yang | Chitta Baral
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et. al., 2017) dataset. We then modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality. Our dataset setup scripts and codes will be made publicly available at https://github.com/shailaja183/clevr_hyp.

pdf
Unsupervised Pronoun Resolution via Masked Noun-Phrase Prediction
Ming Shen | Pratyay Banerjee | Chitta Baral
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In this work, we propose Masked Noun-Phrase Prediction (MNPP), a pre-training strategy to tackle pronoun resolution in a fully unsupervised setting. Firstly, We evaluate our pre-trained model on various pronoun resolution datasets without any finetuning. Our method outperforms all previous unsupervised methods on all datasets by large margins. Secondly, we proceed to a few-shot setting where we finetune our pre-trained model on WinoGrande-S and XS separately. Our method outperforms RoBERTa-large baseline with large margins, meanwhile, achieving a higher AUC score after further finetuning on the remaining three official splits of WinoGrande.

pdf
Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering
Man Luo | Yankai Zeng | Pratyay Banerjee | Chitta Baral
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Knowledge-based visual question answering (VQA) requires answering questions with external knowledge in addition to the content of images. One dataset that is mostly used in evaluating knowledge-based VQA is OK-VQA, but it lacks a gold standard knowledge corpus for retrieval. Existing work leverage different knowledge bases (e.g., ConceptNet and Wikipedia) to obtain external knowledge. Because of varying knowledge bases, it is hard to fairly compare models’ performance. To address this issue, we collect a natural language knowledge base that can be used for any VQA system. Moreover, we propose a Visual Retriever-Reader pipeline to approach knowledge-based VQA. The visual retriever aims to retrieve relevant knowledge, and the visual reader seeks to predict answers based on given knowledge. We introduce various ways to retrieve knowledge using text and images and two reader styles: classification and extraction. Both the retriever and reader are trained with weak supervision. Our experimental results show that a good retriever can significantly improve the reader’s performance on the OK-VQA challenge. The code and corpus are provided in https://github.com/luomancs/retriever_reader_for_okvqa.git.

2020

pdf
Deeply Embedded Knowledge Representation & Reasoning For Natural Language Question Answering: A Practitioner’s Perspective
Arindam Mitra | Sanjay Narayana | Chitta Baral
Proceedings of the Fourth Workshop on Structured Prediction for NLP

Successful application of Knowledge Representation and Reasoning (KR) in Natural Language Understanding (NLU) is largely limited by the availability of a robust and general purpose natural language parser. Even though several projects have been launched in the pursuit of developing a universal meaning representation language, the existence of an accurate universal parser is far from reality. This has severely limited the application of knowledge representation and reasoning (KR) in the field of NLP and also prevented a proper evaluation of KR based NLU systems. Our goal is to build KR based systems for Natural Language Understanding without relying on a parser. Towards this we propose a method named Deeply Embedded Knowledge Representation & Reasoning (DeepEKR) where we replace the parser by a neural network, soften the symbolic representation so that a deterministic mapping exists between the parser neural network and the interpretable logical form, and finally replace the symbolic solver by an equivalent neural network, so the model can be trained end-to-end. We evaluate our method with respect to the task of Qualitative Word Problem Solving on the two available datasets (QuaRTz and QuaRel). Our system achieves same accuracy as that of the state-of-the-art accuracy on QuaRTz, outperforms the state-of-the-art on QuaRel and severely outperforms a traditional KR based system. The results show that the bias introduced by a KR solution does not prevent it from doing a better job at the end task. Moreover, our method is interpretable due to the bias introduced by the KR approach.

pdf
Self-Supervised Knowledge Triplet Learning for Zero-Shot Question Answering
Pratyay Banerjee | Chitta Baral
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The aim of all Question Answering (QA) systems is to generalize to unseen questions. Current supervised methods are reliant on expensive data annotation. Moreover, such annotations can introduce unintended annotator bias, making systems focus more on the bias than the actual task. This work proposes Knowledge Triplet Learning (KTL), a self-supervised task over knowledge graphs. We propose heuristics to create synthetic graphs for commonsense and scientific knowledge. We propose using KTL to perform zero-shot question answering, and our experiments show considerable improvements over large pre-trained transformer language models.

pdf
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning
Zhiyuan Fang | Tejas Gokhale | Pratyay Banerjee | Chitta Baral | Yezhou Yang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent’s actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ~9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.

pdf
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering
Tejas Gokhale | Pratyay Banerjee | Chitta Baral | Yezhou Yang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present MUTANT, a training paradigm that exposes the model to perceptually similar, yet semantically distinct mutations of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer). Unlike existing methods on VQA-CP, MUTANT does not rely on the knowledge about the nature of train and test answer distributions. MUTANT establishes a new state-of-the-art accuracy on VQA-CP with a 10.57% improvement. Our work opens up avenues for the use of semantic input mutations for OOD generalization in question answering.

pdf
Visuo-Linguistic Question Answering (VLQA) Challenge
Shailaja Keyur Sampat | Yezhou Yang | Chitta Baral
Findings of the Association for Computational Linguistics: EMNLP 2020

Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.

2019

pdf
Combining Knowledge Hunting and Neural Language Models to Solve the Winograd Schema Challenge
Ashok Prakash | Arpit Sharma | Arindam Mitra | Chitta Baral
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Winograd Schema Challenge (WSC) is a pronoun resolution task which seems to require reasoning with commonsense knowledge. The needed knowledge is not present in the given text. Automatic extraction of the needed knowledge is a bottleneck in solving the challenge. The existing state-of-the-art approach uses the knowledge embedded in their pre-trained language model. However, the language models only embed part of the knowledge, the ones related to frequently co-existing concepts. This limits the performance of such models on the WSC problems. In this work, we build-up on the language model based methods and augment them with a commonsense knowledge hunting (using automatic extraction from text) module and an explicit reasoning module. Our end-to-end system built in such a manner improves on the accuracy of two of the available language model based approaches by 5.53% and 7.7% respectively. Overall our system achieves the state-of-the-art accuracy of 71.06% on the WSC dataset, an improvement of 7.36% over the previous best.

pdf
Careful Selection of Knowledge to Solve Open Book Question Answering
Pratyay Banerjee | Kuntal Kumar Pal | Arindam Mitra | Chitta Baral
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Open book question answering is a type of natural language based QA (NLQA) where questions are expected to be answered with respect to a given set of open book facts, and common knowledge about a topic. Recently a challenge involving such QA, OpenBookQA, has been proposed. Unlike most other NLQA that focus on linguistic understanding, OpenBookQA requires deeper reasoning involving linguistic understanding as well as reasoning with common knowledge. In this paper we address QA with respect to the OpenBookQA dataset and combine state of the art language models with abductive information retrieval (IR), information gain based re-ranking, passage selection and weighted scoring to achieve 72.0% accuracy, an 11.6% improvement over the current state of the art.

pdf
Identification of Adverse Drug Reaction Mentions in Tweets – SMM4H Shared Task 2019
Samarth Rawal | Siddharth Rawal | Saadat Anwar | Chitta Baral
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

Analyzing social media posts can offer insights into a wide range of topics that are commonly discussed online, providing valuable information for studying various health-related phenomena reported online. The outcome of this work can offer insights into pharmacovigilance research to monitor the adverse effects of medications. This research specifically looks into mentions of adverse drug reactions (ADRs) in Twitter data through the Social Media Mining for Health Applications (SMM4H) Shared Task 2019. Adverse drug reactions are undesired harmful effects which can arise from medication or other methods of treatment. The goal of this research is to build accurate models using natural language processing techniques to detect reports of adverse drug reactions in Twitter data and extract these words or phrases.

2016

pdf
Learning To Use Formulas To Solve Simple Arithmetic Problems
Arindam Mitra | Chitta Baral
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf
Learning to Automatically Solve Logic Grid Puzzles
Arindam Mitra | Chitta Baral
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Recognizing Social Constructs from Textual Conversation
Somak Aditya | Chitta Baral | Nguyen Ha Vo | Joohyung Lee | Jieping Ye | Zaw Naung | Barry Lumpkin | Jenny Hastings | Richard Scherl | Dawn M. Sweet | Daniela Inclezan
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
The NL2KR Platform for building Natural Language Translation Systems
Nguyen Vo | Arindam Mitra | Chitta Baral
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
Identifying Various Kinds of Event Mentions in K-Parser Output
Arpit Sharma | Nguyen Vo | Somak Aditya | Chitta Baral
Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

2011

pdf
Using Inverse lambda and Generalization to Translate English to Formal Languages
Chitta Baral | Juraj Dzifcak | Marcos Alvarez Gonzalez | Jiayu Zhou
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

2009

pdf
Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text
Siddhartha Jonnalagadda | Luis Tari | Jörg Hakenberg | Chitta Baral | Graciela Gonzalez
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2005

pdf
IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text
Syed Toufeeq Ahmed | Deepthi Chidambaram | Hasan Davulcu | Chitta Baral
Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics

2004

pdf
Using answer set programming to answer complex queries
Chitta Baral | Michael Gelfond | Richard Scherl
Proceedings of the Workshop on Pragmatics of Question Answering at HLT-NAACL 2004