Elizabeth Clark


2022

pdf
Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)
Elizabeth Clark | Faeze Brahman | Mohit Iyyer
Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)

2021

pdf
Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models
Elizabeth Clark | Noah A. Smith
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Story generation is an open-ended and subjective task, which poses a challenge for evaluating story generation models. We present Choose Your Own Adventure, a collaborative writing setup for pairwise model evaluation. Two models generate suggestions to people as they write a short story; we ask writers to choose one of the two suggestions, and we observe which model’s suggestions they prefer. The setup also allows further analysis based on the revisions people make to the suggestions. We show that these measures, combined with automatic metrics, provide an informative picture of the models’ performance, both in cases where the differences in generation methods are small (nucleus vs. top-k sampling) and large (GPT2 vs. Fusion models).

pdf
TuringAdvice: A Generative and Dynamic Evaluation of Language Use
Rowan Zellers | Ari Holtzman | Elizabeth Clark | Lianhui Qin | Ali Farhadi | Yejin Choi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding: our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today’s models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, T5, writes advice that is at least as helpful as human-written advice in only 14% of cases; a much larger non-finetunable GPT3 model does even worse at 4%. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

pdf
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
Elizabeth Clark | Tal August | Sofia Serrano | Nikita Haduong | Suchin Gururangan | Noah A. Smith
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Human evaluations are typically considered the gold standard in natural language generation, but as models’ fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts’ ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators’ accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.

pdf
Proceedings of the Third Workshop on Narrative Understanding
Nader Akoury | Faeze Brahman | Snigdha Chaturvedi | Elizabeth Clark | Mohit Iyyer | Lara J. Martin
Proceedings of the Third Workshop on Narrative Understanding

2020

pdf
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events
Claire Bonial | Tommaso Caselli | Snigdha Chaturvedi | Elizabeth Clark | Ruihong Huang | Mohit Iyyer | Alejandro Jaimes | Heng Ji | Lara J. Martin | Ben Miller | Teruko Mitamura | Nanyun Peng | Joel Tetreault
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events

pdf
Exploring the Effect of Author and Reader Identity in Online Story Writing: the STORIESINTHEWILD Corpus.
Tal August | Maarten Sap | Elizabeth Clark | Katharina Reinecke | Noah A. Smith
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events

Current story writing or story editing systems rely on human judgments of story quality for evaluating performance, often ignoring the subjectivity in ratings. We analyze the effect of author and reader characteristics and story writing setup on the quality of stories in a short storytelling task. To study this effect, we create and release STORIESINTHEWILD, containing 1,630 stories collected on a volunteer-based crowdsourcing platform. Each story is rated by three different readers, and comes paired with the author’s and reader’s age, gender, and personality. Our findings show significant effects of authors’ and readers’ identities, as well as writing setup, on story writing and ratings. Notably, compared to younger readers, readers age 45 and older consider stories significantly less creative and less entertaining. Readers also prefer stories written all at once, rather than in chunks, finding them more coherent and creative. We also observe linguistic differences associated with authors’ demographics (e.g., older authors wrote more vivid and emotional stories). Our findings suggest that reader and writer demographics, as well as writing setup, should be accounted for in story writing evaluations.

2019

pdf
Counterfactual Story Reasoning and Generation
Lianhui Qin | Antoine Bosselut | Ari Holtzman | Chandra Bhagavatula | Elizabeth Clark | Yejin Choi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Counterfactual reasoning requires predicting how alternative events, contrary to what actually happened, might have resulted in different outcomes. Despite being considered a necessary component of AI-complete systems, few resources have been developed for evaluating counterfactual reasoning in narratives. In this paper, we propose Counterfactual Story Rewriting: given an original story and an intervening counterfactual event, the task is to minimally revise the story to make it compatible with the given counterfactual event. Solving this task will require deep understanding of causal narrative chains and counterfactual invariance, and integration of such story reasoning capabilities into conditional language generation models. We present TIMETRAVEL, a new dataset of 29,849 counterfactual rewritings, each with the original story, a counterfactual event, and human-generated revision of the original story compatible with the counterfactual event. Additionally, we include 81,407 counterfactual “branches” without a rewritten storyline to support future work on semi- or un-supervised approaches to counterfactual story rewriting. Finally, we evaluate the counterfactual rewriting capacities of several competitive baselines based on pretrained language models, and assess whether common overlap and model-based automatic metrics for text generation correlate well with human scores for counterfactual rewriting.

pdf
Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts
Elizabeth Clark | Asli Celikyilmaz | Noah A. Smith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover’s similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover’s similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

pdf
Proceedings of the First Workshop on Narrative Understanding
David Bamman | Snigdha Chaturvedi | Elizabeth Clark | Madalina Fiterau | Mohit Iyyer
Proceedings of the First Workshop on Narrative Understanding

2018

pdf
Neural Text Generation in Stories Using Entity Representations as Context
Elizabeth Clark | Yangfeng Ji | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We introduce an approach to neural text generation that explicitly represents entities mentioned in the text. Entity representations are vectors that are updated as the text proceeds; they are designed specifically for narrative text like fiction or news stories. Our experiments demonstrate that modeling entities offers a benefit in two automatic evaluations: mention generation (in which a model chooses which entity to mention next and which words to use in the mention) and selection between a correct next sentence and a distractor from later in the same story. We also conduct a human evaluation on automatically generated text in story contexts; this study supports our emphasis on entities and suggests directions for further research.

pdf
Sounding Board: A User-Centric and Content-Driven Social Chatbot
Hao Fang | Hao Cheng | Maarten Sap | Elizabeth Clark | Ari Holtzman | Yejin Choi | Noah A. Smith | Mari Ostendorf
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present Sounding Board, a social chatbot that won the 2017 Amazon Alexa Prize. The system architecture consists of several components including spoken language processing, dialogue management, language generation, and content management, with emphasis on user-centric and content-driven design. We also share insights gained from large-scale online logs based on 160,000 conversations with real-world users.