Devi Parikh


VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator
Ayush Shrivastava | Karthik Gopalakrishnan | Yang Liu | Robinson Piramuthu | Gokhan Tur | Devi Parikh | Dilek Hakkani-Tur
Findings of the Association for Computational Linguistics: ACL 2022

Interactive robots navigating photo-realistic environments need to be trained to effectively leverage and handle the dynamic nature of dialogue in addition to the challenges underlying vision-and-language navigation (VLN). In this paper, we present VISITRON, a multi-modal Transformer-based navigator better suited to the interactive regime inherent to Cooperative Vision-and-Dialog Navigation (CVDN). VISITRON is trained to: i) identify and associate object-level concepts and semantics between the environment and dialogue history, ii) identify when to interact vs. navigate via imitation learning of a binary classification head. We perform extensive pre-training and fine-tuning ablations with VISITRON to gain empirical insights and improve performance on CVDN. VISITRON’s ability to identify when to interact leads to a natural generalization of the game-play mode introduced by Roman et al. (2020) for enabling the use of such models in different environments. VISITRON is competitive with models on the static CVDN leaderboard and attains state-of-the-art performance on the Success weighted by Path Length (SPL) metric.


SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency
Sameer Dharur | Purva Tendulkar | Dhruv Batra | Devi Parikh | Ramprasaath R. Selvaraju
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent research in Visual Question Answering (VQA) has revealed state-of-the-art models to be inconsistent in their understanding of the world - they answer seemingly difficult questions requiring reasoning correctly but get simpler associated sub-questions wrong. These sub-questions pertain to lower level visual concepts in the image that models ideally should understand to be able to answer the reasoning question correctly. To address this, we first present a gradient-based interpretability approach to determine the questions most strongly correlated with the reasoning question on an image, and use this to evaluate VQA models on their ability to identify the relevant sub-questions needed to answer a reasoning question. Next, we propose a contrastive gradient learning based approach called Sub-question Oriented Tuning (SOrT) which encourages models to rank relevant sub-questions higher than irrelevant questions for an <image, reasoning-question> pair. We show that SOrT improves model consistency by up to 6.5% points over existing approaches, while also improving visual grounding and robustness to rephrasings of questions.


Where Are You? Localization from Embodied Dialog
Meera Hahn | Jacob Krantz | Dhruv Batra | Devi Parikh | James Rehg | Stefan Lee | Peter Anderson
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present WHERE ARE YOU? (WAY), a dataset of ~6k dialogs in which two humans – an Observer and a Locator – complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task – providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observer’s location within 3m in unseen buildings, vs. 70.4% for human Locators.


CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
Satwik Kottur | José M. F. Moura | Devi Parikh | Dhruv Batra | Marcus Rohrbach
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image (using the conversation history as context). It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the ‘state’ of all images and dialogs. We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs. We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our code and dataset are publicly available.

CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication
Jin-Hwa Kim | Nikita Kitaev | Xinlei Chen | Marcus Rohrbach | Byoung-Tak Zhang | Yuandong Tian | Dhruv Batra | Devi Parikh
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human players. We define protocols and metrics to evaluate learned agents in this testbed, highlighting the need for a novel “crosstalk” evaluation condition which pairs agents trained independently on disjoint subsets of the training data. We present models for our task and benchmark them using both fully automated evaluation and by having them play the game live with humans.

Improving Generative Visual Dialog by Answering Diverse Questions
Vishvak Murahari | Prithvijit Chattopadhyay | Dhruv Batra | Devi Parikh | Abhishek Das
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Prior work on training generative Visual Dialog models with reinforcement learning ((Das et al., ICCV 2017) has explored a Q-Bot-A-Bot image-guessing game and shown that this ‘self-talk’ approach can lead to improved performance at the downstream dialog-conditioned image-guessing task. However, this improvement saturates and starts degrading after a few rounds of interaction, and does not lead to a better Visual Dialog model. We find that this is due in part to repeated interactions between Q-Bot and A-BOT during self-talk, which are not informative with respect to the image. To improve this, we devise a simple auxiliary objective that incentivizes Q-Bot to ask diverse questions, thus reducing repetitions and in turn enabling A-Bot to explore a larger state space during RL i.e. be exposed to more visual concepts to talk about, and varied questions to answer. We evaluate our approach via a host of automatic metrics and human studies, and demonstrate that it leads to better dialog, i.e. dialog that is more diverse (i.e. less repetitive), consistent (i.e. has fewer conflicting exchanges), fluent (i.e., more human-like), and detailed, while still being comparably image-relevant as prior work and ablations.


Do explanations make VQA models more predictable to a human?
Arjun Chandrasekaran | Viraj Prabhu | Deshraj Yadav | Prithvijit Chattopadhyay | Devi Parikh
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

A rich line of research attempts to make deep neural networks more transparent by generating human-interpretable ‘explanations’ of their decision process, especially for interactive tasks like Visual Question Answering (VQA). In this work, we analyze if existing explanations indeed make a VQA model — its responses as well as failures — more predictable to a human. Surprisingly, we find that they do not. On the other hand, we find that human-in-the-loop approaches that treat the model as a black-box do.

Punny Captions: Witty Wordplay in Image Descriptions
Arjun Chandrasekaran | Devi Parikh | Mohit Bansal
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Wit is a form of rich interaction that is often grounded in a specific situation (e.g., a comment in response to an event). In this work, we attempt to build computational models that can produce witty descriptions for a given image. Inspired by a cognitive account of humor appreciation, we employ linguistic wordplay, specifically puns, in image descriptions. We develop two approaches which involve retrieving witty descriptions for a given image from a large corpus of sentences, or generating them via an encoder-decoder neural network architecture. We compare our approach against meaningful baseline approaches via human studies and show substantial improvements. Moreover, in a Turing test style evaluation, people find the image descriptions generated by our model to be slightly wittier than human-written witty descriptions when the human is subject to similar constraints as the model regarding word usage and style.


Sound-Word2Vec: Learning Word Representations Grounded in Sounds
Ashwin Vijayakumar | Ramakrishna Vedantam | Devi Parikh
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

To be able to interact better with humans, it is crucial for machines to understand sound – a primary modality of human perception. Previous works have used sound to learn embeddings for improved generic semantic similarity assessment. In this work, we treat sound as a first-class citizen, studying downstream 6textual tasks which require aural grounding. To this end, we propose sound-word2vec – a new embedding scheme that learns specialized word embeddings grounded in sounds. For example, we learn that two seemingly (semantically) unrelated concepts, like leaves and paper are similar due to the similar rustling sounds they make. Our embeddings prove useful in textual tasks requiring aural reasoning like text-based sound retrieval and discovering Foley sound effects (used in movies). Moreover, our embedding space captures interesting dependencies between words and onomatopoeia and outperforms prior work on aurally-relevant word relatedness datasets such as AMEN and ASLex.

Deal or No Deal? End-to-End Learning of Negotiation Dialogues
Mike Lewis | Denis Yarats | Yann Dauphin | Devi Parikh | Dhruv Batra
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Much of human dialogue occurs in semi-cooperative settings, where agents with different goals attempt to agree on common decisions. Negotiations require complex communication and reasoning skills, but success is easy to measure, making this an interesting task for AI. We gather a large dataset of human-human negotiations on a multi-issue bargaining task, where agents who cannot observe each other’s reward functions must reach an agreement (or a deal) via natural language dialogue. For the first time, we show it is possible to train end-to-end models for negotiation, which must learn both linguistic and reasoning skills with no annotated dialogue states. We also introduce dialogue rollouts, in which the model plans ahead by simulating possible complete continuations of the conversation, and find that this technique dramatically improves performance. Our code and dataset are publicly available.

ParlAI: A Dialog Research Software Platform
Alexander Miller | Will Feng | Dhruv Batra | Antoine Bordes | Adam Fisch | Jiasen Lu | Devi Parikh | Jason Weston
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce ParlAI (pronounced “par-lay”), an open-source software platform for dialog research implemented in Python, available at Its goal is to provide a unified framework for sharing, training and testing dialog models; integration of Amazon Mechanical Turk for data collection, human evaluation, and online/reinforcement learning; and a repository of machine learning models for comparing with others’ models, and improving upon existing architectures. Over 20 tasks are supported in the first release, including popular datasets such as SQuAD, bAbI tasks, MCTest, WikiQA, QACNN, QADailyMail, CBT, bAbI Dialog, Ubuntu, OpenSubtitles and VQA. Several models are integrated, including neural models such as memory networks, seq2seq and attentive LSTMs.


Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions
Arijit Ray | Gordon Christie | Mohit Bansal | Dhruv Batra | Devi Parikh
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Sort Story: Sorting Jumbled Images and Captions into Stories
Harsh Agrawal | Arjun Chandrasekaran | Dhruv Batra | Devi Parikh | Mohit Bansal
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?
Abhishek Das | Harsh Agrawal | Larry Zitnick | Devi Parikh | Dhruv Batra
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Analyzing the Behavior of Visual Question Answering Models
Aishwarya Agrawal | Dhruv Batra | Devi Parikh
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
Nasrin Mostafazadeh | Nathanael Chambers | Xiaodong He | Devi Parikh | Dhruv Batra | Lucy Vanderwende | Pushmeet Kohli | James Allen
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Visual Storytelling
Ting-Hao Kenneth Huang | Francis Ferraro | Nasrin Mostafazadeh | Ishan Misra | Aishwarya Agrawal | Jacob Devlin | Ross Girshick | Xiaodong He | Pushmeet Kohli | Dhruv Batra | C. Lawrence Zitnick | Devi Parikh | Lucy Vanderwende | Michel Galley | Margaret Mitchell
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies