Abigail See


Neural Generation Meets Real People: Building a Social, Informative Open-Domain Dialogue Agent
Ethan A. Chi | Ashwin Paranjape | Abigail See | Caleb Chiam | Trenton Chang | Kathleen Kenealy | Swee Kiat Lim | Amelia Hardy | Chetanya Rastogi | Haojun Li | Alexander Iyabor | Yutong He | Hari Sowrirajan | Peng Qi | Kaushik Ram Sadagopan | Nguyet Minh Phu | Dilara Soylu | Jillian Tang | Avanika Narayan | Giovanni Campagna | Christopher Manning
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

We present Chirpy Cardinal, an open-domain social chatbot. Aiming to be both informative and conversational, our bot chats with users in an authentic, emotionally intelligent way. By integrating controlled neural generation with scaffolded, hand-written dialogue, we let both the user and bot take turns driving the conversation, producing an engaging and socially fluent experience. Deployed in the fourth iteration of the Alexa Prize Socialbot Grand Challenge, Chirpy Cardinal handled thousands of conversations per day, placing second out of nine bots with an average user rating of 3.58/5.


Understanding and predicting user dissatisfaction in a neural generative chatbot
Abigail See | Christopher Manning
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Neural generative dialogue agents have shown an increasing ability to hold short chitchat conversations, when evaluated by crowdworkers in controlled settings. However, their performance in real-life deployment – talking to intrinsically-motivated users in noisy environments – is less well-explored. In this paper, we perform a detailed case study of a neural generative model deployed as part of Chirpy Cardinal, an Alexa Prize socialbot. We find that unclear user utterances are a major source of generative errors such as ignoring, hallucination, unclearness and repetition. However, even in unambiguous contexts the model frequently makes reasoning errors. Though users express dissatisfaction in correlation with these errors, certain dissatisfaction types (such as offensiveness and privacy objections) depend on additional factors – such as the user’s personal attitudes, and prior unaddressed dissatisfaction in the conversation. Finally, we show that dissatisfied user utterances can be used as a semi-supervised learning signal to improve the dialogue system. We train a model to predict next-turn dissatisfaction, and show through human evaluation that as a ranking function, it selects higher-quality neural-generated utterances.


What makes a good conversation? How controllable attributes affect human judgments
Abigail See | Stephen Roller | Douwe Kiela | Jason Weston
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

A good conversation requires balance – between simplicity and detail; staying on topic and changing it; asking questions and answering them. Although dialogue agents are commonly evaluated via human judgments of overall quality, the relationship between quality and these individual factors is less well-studied. In this work, we examine two controllable neural text generation methods, conditional training and weighted decoding, in order to control four important attributes for chit-chat dialogue: repetition, specificity, response-relatedness and question-asking. We conduct a large-scale human evaluation to measure the effect of these control parameters on multi-turn interactive conversations on the PersonaChat task. We provide a detailed analysis of their relationship to high-level aspects of conversation, and show that by controlling combinations of these variables our models obtain clear improvements in human quality judgments.

Do Massively Pretrained Language Models Make Better Storytellers?
Abigail See | Aneesh Pappu | Rohun Saxena | Akhila Yerukola | Christopher D. Manning
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Large neural language models trained on massive amounts of text have emerged as a formidable strategy for Natural Language Understanding tasks. However, the strength of these models as Natural Language Generators is less clear. Though anecdotal evidence suggests that these models generate better quality text, there has been no detailed study characterizing their generation abilities. In this work, we compare the performance of an extensively pretrained model, OpenAI GPT2-117 (Radford et al., 2019), to a state-of-the-art neural story generation model (Fan et al., 2018). By evaluating the generated text across a wide variety of automatic metrics, we characterize the ways in which pretrained models do, and do not, make better storytellers. We find that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms.


Get To The Point: Summarization with Pointer-Generator Networks
Abigail See | Peter J. Liu | Christopher D. Manning
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.


Compression of Neural Machine Translation Models via Pruning
Abigail See | Minh-Thang Luong | Christopher D. Manning
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning