Satwik Kottur


2022

pdf
Database Search Results Disambiguation for Task-Oriented Dialog Systems
Kun Qian | Satwik Kottur | Ahmad Beirami | Shahin Shayandeh | Paul Crook | Alborz Geramifard | Zhou Yu | Chinnadhurai Sankar
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

As task-oriented dialog systems are becoming increasingly popular in our lives, more realistic tasks have been proposed and explored. However, new practical challenges arise. For instance, current dialog systems cannot effectively handle multiplesearch results when querying a database, due to the lack of such scenarios in existing public datasets. In this paper, we propose Database Search Result (DSR) Disambiguation, a novel task that focuses on disambiguating database search results, which enhances user experience by allowing them to choose from multiple options instead of just one. To study this task, we augment the popular task-oriented dialog datasets (MultiWOZ and SGD) with turns that resolve ambiguities by (a) synthetically generating turns through a pre-defined grammar, and (b) collecting human paraphrases for a subset. We find that training on our augmented dialog data improves the model’s ability to deal with ambiguous scenarios, without sacrificing performance on unmodified turns. Furthermore, pre-fine tuning and multi-task learning help our model to improve performance on DSR-disambiguation even in the absence of in-domain data, suggesting that it can be learned as a universal dialog skill. Our data and code will be made publicly available.

2021

pdf
DialogStitch: Synthetic Deeper and Multi-Context Task-Oriented Dialogs
Satwik Kottur | Chinnadhurai Sankar | Zhou Yu | Alborz Geramifard
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Real-world conversational agents must effectively handle long conversations that span multiple contexts. Such context can be interspersed with chitchat (dialog turns not directly related to the task at hand), and potentially grounded in a multimodal setting. While prior work focused on the above aspects in isolation, there is a lack of a unified framework that studies them together. To overcome this, we propose DialogStitch, a novel framework to seamlessly ‘stitch’ multiple conversations and highlight these desirable traits in a taskoriented dialog. After stitching, our dialogs are provably deeper, contain longer-term dependencies, and span multiple contexts, when compared with the source dialogs—all free of cost without any additional annotations! Though our framework generalizes to a variety of combinations, we demonstrate its benefits in two settings: (a) multimodal, imagegrounded conversations, and, (b) task-oriented dialogs fused with chit-chat conversations. We benchmark state-of-the-art dialog models on our datasets and find accuracy drops of (a) 12% and (b) 45% respectively, indicating the additional challenges in the stitched dialogs. Our code and data are publicly available.

pdf
An Analysis of State-of-the-Art Models for Situated Interactive MultiModal Conversations (SIMMC)
Satwik Kottur | Paul Crook | Seungwhan Moon | Ahmad Beirami | Eunjoon Cho | Rajen Subba | Alborz Geramifard
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

There is a growing interest in virtual assistants with multimodal capabilities, e.g., inferring the context of a conversation through scene understanding. The recently released situated and interactive multimodal conversations (SIMMC) dataset addresses this trend by enabling research to create virtual assistants, which are capable of taking into account the scene that user sees when conversing with the user and also interacting with items in the scene. The SIMMC dataset is novel in that it contains fully annotated user-assistant, task-orientated dialogs where the user and an assistant co-observe the same visual elements and the latter can take actions to update the scene. The SIMMC challenge, held as part of theNinth Dialog System Technology Challenge(DSTC9), propelled the development of various models which together set a new state-of-the-art on the SIMMC dataset. In this work, we compare and analyze these models to identify‘what worked?’, and the remaining gaps;‘whatnext?’. Our analysis shows that even though pretrained language models adapted to this set-ting show great promise, there are indications that multimodal context isn’t fully utilised, and there is a need for better and scalable knowledge base integration. We hope this first-of-its-kind analysis for SIMMC models provides useful insights and opportunities for further research in multimodal conversational agents

pdf
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Hung Le | Chinnadhurai Sankar | Seungwhan Moon | Ahmad Beirami | Alborz Geramifard | Satwik Kottur
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Video-grounded Dialogue. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from 11k CATER synthetic videos and contains 10 instances of 10-round dialogues for each video, resulting in more than 100k dialogues and 1M question-answer pairs. Our code and dataset are publicly available.

pdf
SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations
Satwik Kottur | Seungwhan Moon | Alborz Geramifard | Babak Damavandi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the real-world multimodal environment. Existing task-oriented dialog datasets aimed towards virtual assistance fall short and do not situate the dialog in the user’s multimodal context. To overcome, we present a new dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs (117K utterances) in the shopping domain, grounded in immersive and photo-realistic scenes. The dialogs are collection using a two-phase pipeline: (1) A novel multimodal dialog simulator generates simulated dialog flows, with an emphasis on diversity and richness of interactions, (2) Manual paraphrasing of generating utterances to draw from natural language distribution. We provide an in-depth analysis of the collected dataset, and describe in detail the four main benchmark tasks we propose for SIMMC 2.0. Our baseline model, powered by the state-of-the-art language model, shows promising results, and highlights new challenges and directions for the community to study.

2020

pdf
Situated and Interactive Multimodal Conversations
Seungwhan Moon | Satwik Kottur | Paul Crook | Ankita De | Shivani Poddar | Theodore Levin | David Whitney | Daniel Difranco | Ahmad Beirami | Eunjoon Cho | Rajen Subba | Alborz Geramifard
Proceedings of the 28th International Conference on Computational Linguistics

Next generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, and the user’s utterances), and perform multimodal actions (, displaying a route while generating the system’s utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling ~13K human-human dialogs (~169K utterances) collected using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture – grounded in a shared virtual environment; and (b) fashion – grounded in an evolving set of images. Datasets include multimodal context of the items appearing in each scene, and contextual NLU, NLG and coreference annotations using a novel and unified framework of SIMMC conversational acts for both user and assistant utterances. Finally, we present several tasks within SIMMC as objective evaluation protocols, such as structural API prediction, response generation, and dialog state tracking. We benchmark a collection of existing models on these SIMMC tasks as strong baselines, and demonstrate rich multimodal conversational interactions. Our data, annotations, and models will be made publicly available.

2019

pdf
CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
Satwik Kottur | José M. F. Moura | Devi Parikh | Dhruv Batra | Marcus Rohrbach
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image (using the conversation history as context). It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the ‘state’ of all images and dialogs. We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs. We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our code and dataset are publicly available.

2017

pdf
Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog
Satwik Kottur | José Moura | Stefan Lee | Dhruv Batra
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

A number of recent works have proposed techniques for end-to-end learning of communication protocols among cooperative multi-agent populations, and have simultaneously found the emergence of grounded human-interpretable language in the protocols developed by the agents, learned without any human supervision! In this paper, using a Task & Talk reference game between two agents as a testbed, we present a sequence of ‘negative’ results culminating in a ‘positive’ one – showing that while most agent-invented languages are effective (i.e. achieve near-perfect task rewards), they are decidedly not interpretable or compositional. In essence, we find that natural language does not emerge ‘naturally’,despite the semblance of ease of natural-language-emergence that one may gather from recent literature. We discuss how it is possible to coax the invented languages to become more and more human-like and compositional by increasing restrictions on how two agents may communicate.