Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events

Claire Bonial, Tommaso Caselli, Snigdha Chaturvedi, Elizabeth Clark, Ruihong Huang, Mohit Iyyer, Alejandro Jaimes, Heng Ji, Lara J. Martin, Ben Miller, Teruko Mitamura, Nanyun Peng, Joel Tetreault (Editors)

Anthology ID:
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events
Claire Bonial | Tommaso Caselli | Snigdha Chaturvedi | Elizabeth Clark | Ruihong Huang | Mohit Iyyer | Alejandro Jaimes | Heng Ji | Lara J. Martin | Ben Miller | Teruko Mitamura | Nanyun Peng | Joel Tetreault

pdf bib
New Insights into Cross-Document Event Coreference: Systematic Comparison and a Simplified Approach
Andres Cremisini | Mark Finlayson

Cross-Document Event Coreference (CDEC) is the task of finding coreference relationships between events in separate documents, most commonly assessed using the Event Coreference Bank+ corpus (ECB+). At least two different approaches have been proposed for CDEC on ECB+ that use only event triggers, and at least four have been proposed that use both triggers and entities. Comparing these approaches is complicated by variation in the systems’ use of gold vs. computed labels, as well as variation in the document clustering pre-processing step. We present an approach that matches or slightly beats state-of-the-art performance on CDEC over ECB+ with only event trigger annotations, but with a significantly simpler framework and much smaller feature set relative to prior work. This study allows us to directly compare with prior systems and draw conclusions about the effectiveness of various strategies. Additionally, we provide the first cross-validated evaluation on the ECB+ dataset; the first explicit evaluation of the pairwise event coreference classification step; and the first quantification of the effect of document clustering on system performance. The last in particular reveals that while document clustering is a crucial pre-processing step, improvements can at most provide for a 3 point improvement in CDEC performance, though this might be attributable to ease of document clustering on ECB+.

pdf bib
Screenplay Quality Assessment: Can We Predict Who Gets Nominated?
Ming-Chang Chiu | Tiantian Feng | Xiang Ren | Shrikanth Narayanan

Deciding which scripts to turn into movies is a costly and time-consuming process for filmmakers. Thus, building a tool to aid script selection, an initial phase in movie production, can be very beneficial. Toward that goal, in this work, we present a method to evaluate the quality of a screenplay based on linguistic cues. We address this in a two-fold approach: (1) we define the task as predicting nominations of scripts at major film awards with the hypothesis that the peer-recognized scripts should have a greater chance to succeed. (2) based on industry opinions and narratology, we extract and integrate domain-specific features into common classification techniques. We face two challenges (1) scripts are much longer than other document datasets (2) nominated scripts are limited and thus difficult to collect. However, with narratology-inspired modeling and domain features, our approach offers clear improvements over strong baselines. Our work provides a new approach for future work in screenplay analysis.

Improving the Identification of the Discourse Function of News Article Paragraphs
Deya Banisakher | W. Victor Yarlott | Mohammed Aldawsari | Naphtali Rishe | Mark Finlayson

Identifying the discourse structure of documents is an important task in understanding written text. Building on prior work, we demonstrate an improved approach to automatically identifying the discourse function of paragraphs in news articles. We start with the hierarchical theory of news discourse developed by van Dijk (1988) which proposes how paragraphs function within news articles. This discourse information is a level intermediate between phrase- or sentence-sized discourse segments and document genre, characterizing how individual paragraphs convey information about the events in the storyline of the article. Specifically, the theory categorizes the relationships between narrated events and (1) the overall storyline (such as Main Events, Background, or Consequences) as well as (2) commentary (such as Verbal Reactions and Evaluations). We trained and tested a linear chain conditional random field (CRF) with new features to model van Dijk’s labels and compared it against several machine learning models presented in previous work. Our model significantly outperformed all baselines and prior approaches, achieving an average of 0.71 F1 score which represents a 31.5% improvement over the previously best-performing support vector machine model.

Systematic Evaluation of a Framework for Unsupervised Emotion Recognition for Narrative Text
Samira Zad | Mark Finlayson

Identifying emotions as expressed in text (a.k.a. text emotion recognition) has received a lot of attention over the past decade. Narratives often involve a great deal of emotional expression, and so emotion recognition on narrative text is of great interest to computational approaches to narrative understanding. Prior work by Kim et al. 2010 was the work with the highest reported emotion detection performance, on a corpus of fairy tales texts. Close inspection of that work, however, revealed significant reproducibility problems, and we were unable to reimplement Kim’s approach as described. As a consequence, we implemented a framework inspired by Kim’s approach, where we carefully evaluated the major design choices. We identify the highest-performing combination, which outperforms Kim’s reported performance by 7.6 F1 points on average. Close inspection of the annotated data revealed numerous missing and incorrect emotion terms in the relevant lexicon, WordNetAffect (WNA; Strapparava and Valitutti, 2004), which allowed us to augment it in a useful way. More generally, this showed that numerous clearly emotive words and phrases are missing from WNA, which suggests that effort invested in augmenting or refining emotion ontologies could be useful for improving the performance of emotion recognition systems. We release our code and data to definitely enable future reproducibility of this work.

Extensively Matching for Few-shot Learning Event Detection
Viet Dac Lai | Thien Huu Nguyen | Franck Dernoncourt

Current event detection models under supervised learning settings fail to transfer to new event types. Few-shot learning has not been explored in event detection even though it allows a model to perform well with high generalization on new event types. In this work, we formulate event detection as a few-shot learning problem to enable to extend event detection to new event types. We propose two novel loss factors that matching examples in the support set to provide more training signals to the model. Moreover, these training signals can be applied in many metric-based few-shot learning models. Our extensive experiments on the ACE-2005 dataset (under a few-shot learning setting) show that the proposed method can improve the performance of few-shot learning.

Exploring the Effect of Author and Reader Identity in Online Story Writing: the STORIESINTHEWILD Corpus.
Tal August | Maarten Sap | Elizabeth Clark | Katharina Reinecke | Noah A. Smith

Current story writing or story editing systems rely on human judgments of story quality for evaluating performance, often ignoring the subjectivity in ratings. We analyze the effect of author and reader characteristics and story writing setup on the quality of stories in a short storytelling task. To study this effect, we create and release STORIESINTHEWILD, containing 1,630 stories collected on a volunteer-based crowdsourcing platform. Each story is rated by three different readers, and comes paired with the author’s and reader’s age, gender, and personality. Our findings show significant effects of authors’ and readers’ identities, as well as writing setup, on story writing and ratings. Notably, compared to younger readers, readers age 45 and older consider stories significantly less creative and less entertaining. Readers also prefer stories written all at once, rather than in chunks, finding them more coherent and creative. We also observe linguistic differences associated with authors’ demographics (e.g., older authors wrote more vivid and emotional stories). Our findings suggest that reader and writer demographics, as well as writing setup, should be accounted for in story writing evaluations.

Script Induction as Association Rule Mining
Anton Belyy | Benjamin Van Durme

We show that the count-based Script Induction models of Chambers and Jurafsky (2008) and Jans et al. (2012) can be unified in a general framework of narrative chain likelihood maximization. We provide efficient algorithms based on Association Rule Mining (ARM) and weighted set cover that can discover interesting patterns in the training data and combine them in a reliable and explainable way to predict the missing event. The proposed method, unlike the prior work, does not assume full conditional independence and makes use of higher-order count statistics. We perform the ablation study and conclude that the inductive biases introduced by ARM are conducive to better performance on the narrative cloze test.

Automatic extraction of personal events from dialogue
Joshua Eisenberg | Michael Sheriff

In this paper we introduce the problem of extracting events from dialogue. Previous work on event extraction focused on newswire, however we are interested in extracting events from spoken dialogue. To ground this study, we annotated dialogue transcripts from fourteen episodes of the podcast This American Life. This corpus contains 1,038 utterances, made up of 16,962 tokens, of which 3,664 represent events. The agreement for this corpus has a Cohen’s Kappa of 0.83. We have open-sourced this corpus for the NLP community. With this corpus in hand, we trained support vector machines (SVM) to correctly classify these phenomena with 0.68 F1, when using episode-fold cross-validation. This is nearly 100% higher F1 than the baseline classifier. The SVM models achieved performance of over 0.75 F1 on some testing folds. We report the results for SVM classifiers trained with four different types of features (verb classes, part of speech tags, named entities, and semantic role labels), and different machine learning protocols (under-sampling and trigram context). This work is grounded in narratology and computational models of narrative. It is useful for extracting events, plot, and story content from spoken dialogue.

Annotating and quantifying narrative time disruptions in modernist and hypertext fiction
Edward Kearns

This paper outlines work in progress on a new method of annotating and quantitatively discussing narrative techniques related to time in fiction. Specifically those techniques are analepsis, prolepsis, narrative level changes, and stream-of-consciousness and free-indirect-discourse narration. By counting the frequency and extent of the usage of these techniques, the narrative characteristics of different works from different time periods and genres can be compared. This project uses modernist fiction and hypertext fiction as its case studies.

Exploring aspects of similarity between spoken personal narratives by disentangling them into narrative clause types
Belen Saldias | Deb Roy

Sharing personal narratives is a fundamental aspect of human social behavior as it helps share our life experiences. We can tell stories and rely on our background to understand their context, similarities, and differences. A substantial effort has been made towards developing storytelling machines or inferring characters’ features. However, we don’t usually find models that compare narratives. This task is remarkably challenging for machines since they, as sometimes we do, lack an understanding of what similarity means. To address this challenge, we first introduce a corpus of real-world spoken personal narratives comprising 10,296 narrative clauses from 594 video transcripts. Second, we ask non-narrative experts to annotate those clauses under Labov’s sociolinguistic model of personal narratives (i.e., action, orientation, and evaluation clause types) and train a classifier that reaches 84.7% F-score for the highest-agreed clauses. Finally, we match stories and explore whether people implicitly rely on Labov’s framework to compare narratives. We show that actions followed by the narrator’s evaluation of these are the aspects non-experts consider the most. Our approach is intended to help inform machine learning methods aimed at studying or representing personal narratives.

Extracting Message Sequence Charts from Hindi Narrative Text
Swapnil Hingmire | Nitin Ramrakhiyani | Avinash Kumar Singh | Sangameshwar Patil | Girish Palshikar | Pushpak Bhattacharyya | Vasudeva Varma

In this paper, we propose the use of Message Sequence Charts (MSC) as a representation for visualizing narrative text in Hindi. An MSC is a formal representation allowing the depiction of actors and interactions among these actors in a scenario, apart from supporting a rich framework for formal inference. We propose an approach to extract MSC actors and interactions from a Hindi narrative. As a part of the approach, we enrich an existing event annotation scheme where we provide guidelines for annotation of the mood of events (realis vs irrealis) and guidelines for annotation of event arguments. We report performance on multiple evaluation criteria by experimenting with Hindi narratives from Indian History. Though Hindi is the fourth most-spoken first language in the world, from the NLP perspective it has comparatively lesser resources than English. Moreover, there is relatively less work in the context of event processing in Hindi. Hence, we believe that this work is among the initial works for Hindi event processing.

Emotion Arcs of Student Narratives
Swapna Somasundaran | Xianyang Chen | Michael Flor

This paper studies emotion arcs in student narratives. We construct emotion arcs based on event affect and implied sentiments, which correspond to plot elements in the story. We show that student narratives can show elements of plot structure in their emotion arcs and that properties of these arcs can be useful indicators of narrative quality. We build a system and perform analysis to show that our arc-based features are complementary to previously studied sentiment features in this area.

Frustratingly Hard Evidence Retrieval for QA Over Books
Xiangyang Mou | Mo Yu | Bingsheng Yao | Chenghao Yang | Xiaoxiao Guo | Saloni Potdar | Hui Su

A lot of progress has been made to improve question answering (QA) in recent years, but the special problem of QA over narrative book stories has not been explored in-depth. We formulate BookQA as an open-domain QA task given its similar dependency on evidence retrieval. We further investigate how state-of-the-art open-domain QA approaches can help BookQA. Besides achieving state-of-the-art on the NarrativeQA benchmark, our study also reveals the difficulty of evidence retrieval in books with a wealth of experiments and analysis - which necessitates future effort on novel solutions for evidence retrieval in BookQA.

On-The-Fly Information Retrieval Augmentation for Language Models
Hai Wang | David McAllester

Here we experiment with the use of information retrieval as an augmentation for pre-trained language models. The text corpus used in information retrieval can be viewed as form of episodic memory which grows over time. By augmenting GPT 2.0 with information retrieval we achieve a zero shot 15% relative reduction in perplexity on Gigaword corpus without any re-training. We also validate our IR augmentation on an event co-reference task.

Detecting and understanding moral biases in news
Usman Shahid | Barbara Di Eugenio | Andrew Rojecki | Elena Zheleva

We describe work in progress on detecting and understanding the moral biases of news sources by combining framing theory with natural language processing. First we draw connections between issue-specific frames and moral frames that apply to all issues. Then we analyze the connection between moral frame presence and news source political leaning. We develop and test a simple classification model for detecting the presence of a moral frame, highlighting the need for more sophisticated models. We also discuss some of the annotation and frame detection challenges that can inform future research in this area.