Ani Nenkova


2023

pdf
Factual or Contextual? Disentangling Error Types in Entity Description Generation
Navita Goyal | Ani Nenkova | Hal Daumé III
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In the task of entity description generation, given a context and a specified entity, a model must describe that entity correctly and in a contextually-relevant way. In this task, as well as broader language generation tasks, the generation of a nonfactual description (factual error) versus an incongruous description (contextual error) is fundamentally different, yet often conflated. We develop an evaluation paradigm that enables us to disentangle these two types of errors in naturally occurring textual contexts. We find that factuality and congruity are often at odds, and that models specifically struggle with accurate descriptions of entities that are less familiar to people. This shortcoming of language models raises concerns around the trustworthiness of such models, since factual errors on less well-known entities are exactly those that a human reader will not recognize.

pdf
Named Entity Recognition in a Very Homogenous Domain
Oshin Agarwal | Ani Nenkova
Findings of the Association for Computational Linguistics: EACL 2023

Machine Learning models have lower accuracy when tested on out-of-domain data. Developing models that perform well on several domains or can be quickly adapted to a new domain is an important research area. Domain, however, is a vague term, that can refer to any aspect of data such as language, genre, source and structure. We consider a very homogeneous source of data, specifically sentences from news articles from the same newspaper in English, and collect a dataset of such “in-domain” sentences annotated with named entities. We find that even in such a homogeneous domain, the performance of named entity recognition models varies significantly across news topics. Selection of diverse data, as we demonstrate, is crucial even in a seemingly homogeneous domain.

2022

pdf
“Am I Answering My Job Interview Questions Right?”: A NLP Approach to Predict Degree of Explanation in Job Interview Responses
Raghu Verrap | Ehsanul Nirjhar | Ani Nenkova | Theodora Chaspari
Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)

Providing the right amount of explanation in an employment interview can help the interviewee effectively communicate their skills and experience to the interviewer and convince the she/he is the right candidate for the job. This paper examines natural language processing (NLP) approaches, including word-based tokenization, lexicon-based representations, and pre-trained embeddings with deep learning models, for detecting the degree of explanation in a job interview response. These are exemplified in a study of 24 military veterans who are the focal group of this study, since they can experience unique challenges in job interviews due to the unique verbal communication style that is prevalent in the military. Military veterans conducted mock interviews with industry recruiters and data from these interviews were transcribed and analyzed. Results indicate that the feasibility of automated NLP methods for detecting the degree of explanation in an interview response. Features based on tokenizer analysis are the most effective in detecting under-explained responses (i.e., 0.29 F1-score), while lexicon-based methods depict the higher performance in detecting over-explanation (i.e., 0.51 F1-score). Findings from this work lay the foundation for the design of intelligent assistive technologies that can provide personalized learning pathways to job candidates, especially those belonging to sensitive or under-represented populations, and helping them succeed in employment job interviews, ultimately contributing to an inclusive workforce.

pdf
DocTime: A Document-level Temporal Dependency Graph Parser
Puneet Mathur | Vlad Morariu | Verena Kaynig-Fittkau | Jiuxiang Gu | Franck Dernoncourt | Quan Tran | Ani Nenkova | Dinesh Manocha | Rajiv Jain
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We introduce DocTime - a novel temporal dependency graph (TDG) parser that takes as input a text document and produces a temporal dependency graph. It outperforms previous BERT-based solutions by a relative 4-8% on three datasets from modeling the problem as a graph network with path-prediction loss to incorporate longer range dependencies. This work also demonstrates how the TDG graph can be used to improve the downstream tasks of temporal questions answering and NLI by a relative 4-10% with a new framework that incorporates the temporal dependency graph into the self-attention layer of Transformer models (Time-transformer). Finally, we develop and evaluate on a new temporal dependency graph dataset for the domain of contractual documents, which has not been previously explored in this setting.

pdf bib
Proceedings of the Second Workshop on Bridging Human--Computer Interaction and Natural Language Processing
Su Lin Blodgett | Hal Daumé III | Michael Madaio | Ani Nenkova | Brendan O'Connor | Hanna Wallach | Qian Yang
Proceedings of the Second Workshop on Bridging Human--Computer Interaction and Natural Language Processing

pdf
Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns
Zihan Wang | Jiuxiang Gu | Jason Kuen | Handong Zhao | Vlad Morariu | Ruiyi Zhang | Ani Nenkova | Tong Sun | Jingbo Shang
Findings of the Association for Computational Linguistics: ACL 2022

We present a comprehensive study of sparse attention patterns in Transformer models. We first question the need for pre-training with sparse attention and present experiments showing that an efficient fine-tuning only approach yields a slightly worse but still competitive model. Then we compare the widely used local attention pattern and the less-well-studied global attention pattern, demonstrating that global patterns have several unique advantages. We also demonstrate that a flexible approach to attention, with different patterns across different layers of the model, is beneficial for some tasks. Drawing on this insight, we propose a novel Adaptive Axis Attention method, which learns—during fine-tuning—different attention patterns for each Transformer layer depending on the downstream task. Rather than choosing a fixed attention pattern, the adaptive axis attention method identifies important tokens—for each task and model layer—and focuses attention on those. It does not require pre-training to accommodate the sparse patterns and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.

pdf
Influence Functions for Sequence Tagging Models
Sarthak Jain | Varun Manjunatha | Byron Wallace | Ani Nenkova
Findings of the Association for Computational Linguistics: EMNLP 2022

Many standard tasks in NLP (e.g., Named Entity Recognition, Part-of-Speech tagging, and Semantic Role Labeling) are naturally framed as sequence tagging problems. However, there has been comparatively little work on interpretability methods for sequence tagging models. In this paper, we extend influence functions — which aim to trace predictions back to the training points that informed them — to sequence tagging tasks. We define the influence of a training instance segment as the effect that perturbing the labels within this segment has on a test segment level prediction. We provide an efficient approximation to compute this, and show that it tracks with the “true” segment influence (measured empirically). We show the practical utility of segment influence by using the method to identify noisy annotations in NER corpora.

pdf
Context-aware Information-theoretic Causal De-biasing for Interactive Sequence Labeling
Junda Wu | Rui Wang | Tong Yu | Ruiyi Zhang | Handong Zhao | Shuai Li | Ricardo Henao | Ani Nenkova
Findings of the Association for Computational Linguistics: EMNLP 2022

Supervised training of existing deep learning models for sequence labeling relies on large scale labeled datasets. Such datasets are generally created with crowd-source labeling. However, crowd-source labeling for tasks of sequence labeling can be expensive and time-consuming. Further, crowd-source labeling by external annotators may not be appropriate for data that contains user private information. Considering the above limitations of crowd-source labeling, we study interactive sequence labeling that allows training directly with the user feedback, which alleviates the annotation cost and maintains the user privacy. We identify two bias, namely, context bias and feedback bias, by formulating interactive sequence labeling via a Structural Causal Model (SCM). To alleviate the context and feedback bias based on the SCM, we identify the frequent context tokens as confounders in the backdoor adjustment and further propose an entropy-based modulation that is inspired by information theory. entities more sample-efficiently. With extensive experiments, we validate that our approach can effectively alleviate the biases and our models can be efficiently learnt with the user feedback.

pdf
Temporal Effects on Pre-trained Models for Language Processing Tasks
Oshin Agarwal | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 10

Keeping the performance of language technologies optimal as time passes is of great practical interest. We study temporal effects on model performance on downstream language tasks, establishing a nuanced terminology for such discussion and identifying factors essential to conduct a robust study. We present experiments for several tasks in English where the label correctness is not dependent on time and demonstrate the importance of distinguishing between temporal model deterioration and temporal domain adaptation for systems using pre-trained representations. We find that, depending on the task, temporal model deterioration is not necessarily a concern. Temporal domain adaptation, however, is beneficial in all cases, with better performance for a given time period possible when the system is trained on temporally more recent data. Therefore, we also examine the efficacy of two approaches for temporal domain adaptation without human annotations on new data. Self-labeling shows consistent improvement and notably, for named entity recognition, leads to better temporal adaptation than even human annotations.

pdf
MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding
Zilong Wang | Jiuxiang Gu | Chris Tensmeyer | Nikolaos Barmpalios | Ani Nenkova | Tong Sun | Jingbo Shang | Vlad Morariu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Document images are a ubiquitous source of data where the text is organized in a complex hierarchical structure ranging from fine granularity (e.g., words), medium granularity (e.g., regions such as paragraphs or figures), to coarse granularity (e.g., the whole page). The spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks. Existing methods learn features from either word-level or region-level but fail to consider both simultaneously. Word-level models are restricted by the fact that they originate from pure-text language models, which only encode the word-level context. In contrast, region-level models attempt to encode regions corresponding to paragraphs or text blocks into a single embedding, but they perform worse with additional word-level features. To deal with these issues, we propose MGDoc, a new multi-modal multi-granular pre-training framework that encodes page-level, region-level, and word-level information at the same time. MGDoc uses a unified text-visual encoder to obtain multi-modal features across different granularities, which makes it possible to project the multi-granular features into the same hyperspace. To model the region-word correlation, we design a cross-granular attention mechanism and specific pre-training tasks for our model to reinforce the model of learning the hierarchy between regions and words. Experiments demonstrate that our proposed model can learn better features that perform well across granularities and lead to improvements in downstream tasks.

pdf
Self-Repetition in Abstractive Neural Summarizers
Nikita Salkar | Thomas Trikalinos | Byron Wallace | Ani Nenkova
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language is associated with a higher rate of self-repetition. In qualitative analysis, we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.

2021

pdf
From Toxicity in Online Comments to Incivility in American News: Proceed with Caution
Anushree Hede | Oshin Agarwal | Linda Lu | Diana C. Mutz | Ani Nenkova
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

The ability to quantify incivility online, in news and in congressional debates is of great interest to political scientists. Computational tools for detecting online incivility for English are now fairly accessible and potentially could be applied more broadly. We test the Jigsaw Perspective API for its ability to detect the degree of incivility on a corpus that we developed, consisting of manual annotations of civility in American news. We demonstrate that toxicity models, as exemplified by Perspective, are inadequate for the analysis of incivility in news. We carry out error analysis that points to the need to develop methods to remove spurious correlations between words often mentioned in the news, especially identity descriptors and incivility. Without such improvements, applying Perspective or similar models on news is likely to lead to wrong conclusions, that are not aligned with the human perception of incivility.

pdf
The Utility and Interplay of Gazetteers and Entity Segmentation for Named Entity Recognition in English
Oshin Agarwal | Ani Nenkova
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf
Interpretability Analysis for Named Entity Recognition to Understand System Predictions and How They Can Improve
Oshin Agarwal | Yinfei Yang | Byron C. Wallace | Ani Nenkova
Computational Linguistics, Volume 47, Issue 1 - March 2021

Named entity recognition systems achieve remarkable performance on domains such as English news. It is natural to ask: What are these models actually learning to achieve this? Are they merely memorizing the names themselves? Or are they capable of interpreting the text and inferring the correct entity type from the linguistic context? We examine these questions by contrasting the performance of several variants of architectures for named entity recognition, with some provided only representations of the context as features. We experiment with GloVe-based BiLSTM-CRF as well as BERT. We find that context does influence predictions, but the main factor driving high performance is learning the named tokens themselves. Furthermore, we find that BERT is not always better at recognizing predictive contexts compared to a BiLSTM-CRF model. We enlist human annotators to evaluate the feasibility of inferring entity types from context alone and find that humans are also mostly unable to infer entity types for the majority of examples on which the context-only system made errors. However, there is room for improvement: A system should be able to recognize any named entity in a predictive context correctly and our experiments indicate that current systems may be improved by such capability. Our human study also revealed that systems and humans do not always learn the same contextual clues, and context-only systems are sometimes correct even when humans fail to recognize the entity type from the context. Finally, we find that one issue contributing to model errors is the use of “entangled” representations that encode both contextual and local token information into a single vector, which can obscure clues. Our results suggest that designing models that explicitly operate over representations of local inputs and context, respectively, may in some cases improve performance. In light of these and related findings, we highlight directions for future work.

2020

pdf
Trialstreamer: Mapping and Browsing Medical Evidence in Real-Time
Benjamin Nye | Ani Nenkova | Iain Marshall | Byron C. Wallace
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce Trialstreamer, a living database of clinical trial reports. Here we mainly describe the evidence extraction component; this extracts from biomedical abstracts key pieces of information that clinicians need when appraising the literature, and also the relations between these. Specifically, the system extracts descriptions of trial participants, the treatments compared in each arm (the interventions), and which outcomes were measured. The system then attempts to infer which interventions were reported to work best by determining their relationship with identified trial outcome measures. In addition to summarizing individual trials, these extracted data elements allow automatic synthesis of results across many trials on the same topic. We apply the system at scale to all reports of randomized controlled trials indexed in MEDLINE, powering the automatic generation of evidence maps, which provide a global view of the efficacy of different interventions combining data from all relevant clinical trials on a topic. We make all code and models freely available alongside a demonstration of the web interface.

bib
Transactions of the Association for Computational Linguistics, Volume 8
Mark Johnson | Brian Roark | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 8

2019

pdf
The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization
Simeng Sun | Ani Nenkova
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

ROUGE is widely used to automatically evaluate summarization systems. However, ROUGE measures semantic overlap between a system summary and a human reference on word-string level, much at odds with the contemporary treatment of semantic meaning. Here we present a suite of experiments on using distributed representations for evaluating summarizers, both in reference-based and in reference-free setting. Our experimental results show that the max value over each dimension of the summary ELMo word embeddings is a good representation that results in high correlation with human ratings. Averaging the cosine similarity of all encoders we tested yields high correlation with manual scores in reference-free setting. The distributed representations outperform ROUGE in recent corpora for abstractive news summarization but are less good on test data used in past evaluations.

bib
Transactions of the Association for Computational Linguistics, Volume 7
Lillian Lee | Mark Johnson | Brian Roark | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 7

pdf
Predicting Annotation Difficulty to Improve Task Routing and Model Performance for Biomedical Information Extraction
Yinfei Yang | Oshin Agarwal | Chris Tar | Byron C. Wallace | Ani Nenkova
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Modern NLP systems require high-quality annotated data. For specialized domains, expert annotations may be prohibitively expensive; the alternative is to rely on crowdsourcing to reduce costs at the risk of introducing noise. In this paper we demonstrate that directly modeling instance difficulty can be used to improve model performance and to route instances to appropriate annotators. Our difficulty prediction model combines two learned representations: a ‘universal’ encoder trained on out of domain data, and a task-specific encoder. Experiments on a complex biomedical information extraction task using expert and lay annotators show that: (i) simply excluding from the training data instances predicted to be difficult yields a small boost in performance; (ii) using difficulty scores to weight instances during training provides further, consistent gains; (iii) assigning instances predicted to be difficult to domain experts is an effective strategy for task routing. Further, our experiments confirm the expectation that for such domain-specific tasks expert annotations are of much higher quality and preferable to obtain if practical and that augmenting small amounts of expert data with a larger set of lay annotations leads to further improvements in model performance.

pdf
Emotion Impacts Speech Recognition Performance
Rushab Munot | Ani Nenkova
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

It has been established that the performance of speech recognition systems depends on multiple factors including the lexical content, speaker identity and dialect. Here we use three English datasets of acted emotion to demonstrate that emotional content also impacts the performance of commercial systems. On two of the corpora, emotion is a bigger contributor to recognition errors than speaker identity and on two, neutral speech is recognized considerably better than emotional speech. We further evaluate the commercial systems on spontaneous interactions that contain portions of emotional speech. We propose and validate on the acted datasets, a method that allows us to evaluate the overall impact of emotion on recognition even when manual transcripts are not available. Using this method, we show that emotion in natural spontaneous dialogue is a less prominent but still significant factor in recognition accuracy.

pdf
Word Embeddings (Also) Encode Human Personality Stereotypes
Oshin Agarwal | Funda Durupınar | Norman I. Badler | Ani Nenkova
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Word representations trained on text reproduce human implicit bias related to gender, race and age. Methods have been developed to remove such bias. Here, we present results that show that human stereotypes exist even for much more nuanced judgments such as personality, for a variety of person identities beyond the typically legally protected attributes and that these are similarly captured in word representations. Specifically, we collected human judgments about a person’s Big Five personality traits formed solely from information about the occupation, nationality or a common noun description of a hypothetical person. Analysis of the data reveals a large number of statistically significant stereotypes in people. We then demonstrate the bias captured in lexical representations is statistically significantly correlated with the documented human bias. Our results, showing bias for a large set of person descriptors for such nuanced traits put in doubt the feasibility of broadly and fairly applying debiasing methods and call for the development of new methods for auditing language technology systems and resources.

pdf
How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature
Simeng Sun | Ori Shapira | Ido Dagan | Ani Nenkova
Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation

We show that plain ROUGE F1 scores are not ideal for comparing current neural systems which on average produce different lengths. This is due to a non-linear pattern between ROUGE F1 and summary length. To alleviate the effect of length during evaluation, we have proposed a new method which normalizes the ROUGE F1 scores of a system by that of a random system with same average output length. A pilot human evaluation has shown that humans prefer short summaries in terms of the verbosity of a summary but overall consider longer summaries to be of higher quality. While human evaluations are more expensive in time and resources, it is clear that normalization, such as the one we proposed for automatic evaluation, will make human evaluations more meaningful.

pdf
Browsing Health: Information Extraction to Support New Interfaces for Accessing Medical Evidence
Soham Parikh | Elizabeth Conrad | Oshin Agarwal | Iain Marshall | Byron Wallace | Ani Nenkova
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

Standard paradigms for search do not work well in the medical context. Typical information needs, such as retrieving a full list of medical interventions for a given condition, or finding the reported efficacy of a particular treatment with respect to a specific outcome of interest cannot be straightforwardly posed in typical text-box search. Instead, we propose faceted-search in which a user specifies a condition and then can browse treatments and outcomes that have been evaluated. Choosing from these, they can access randomized control trials (RCTs) describing individual studies. Realizing such a view of the medical evidence requires information extraction techniques to identify the population, interventions, and outcome measures in an RCT. Patients, health practitioners, and biomedical librarians all stand to benefit from such innovation in search of medical evidence. We present an initial prototype of such an interface applied to pre-registered clinical studies. We also discuss pilot studies into the applicability of information extraction methods to allow for similar access to all published trial results.

pdf bib
Evaluation of named entity coreference
Oshin Agarwal | Sanjay Subramanian | Ani Nenkova | Dan Roth
Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference

In many NLP applications like search and information extraction for named entities, it is necessary to find all the mentions of a named entity, some of which appear as pronouns (she, his, etc.) or nominals (the professor, the German chancellor, etc.). It is therefore important that coreference resolution systems are able to link these different types of mentions to the correct entity name. We evaluate state-of-the-art coreference resolution systems for the task of resolving all mentions to named entities. Our analysis reveals that standard coreference metrics do not reflect adequately the requirements in this task: they do not penalize systems for not identifying any mentions by name to an entity and they reward systems even if systems find correctly mentions to the same entity but fail to link these to a proper name (she–the student–no name). We introduce new metrics for evaluating named entity coreference that address these discrepancies and show that for the comparisons of competitive systems, standard coreference evaluations could give misleading results for this task. We are, however, able to confirm that the state-of-the art system according to traditional evaluations also performs vastly better than other systems on the named entity coreference task.

2018

pdf
A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature
Benjamin Nye | Junyi Jessy Li | Roma Patel | Yinfei Yang | Iain Marshall | Ani Nenkova | Byron Wallace
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a corpus of 5,000 richly annotated abstracts of medical articles describing clinical randomized controlled trials. Annotations include demarcations of text spans that describe the Patient population enrolled, the Interventions studied and to what they were Compared, and the Outcomes measured (the ‘PICO’ elements). These spans are further annotated at a more granular level, e.g., individual interventions within them are marked and mapped onto a structured medical vocabulary. We acquired annotations from a diverse set of workers with varying levels of expertise and cost. We describe our data collection process and the corpus itself in detail. We then outline a set of challenging NLP tasks that would aid searching of the medical literature and the practice of evidence-based medicine.

pdf
Syntactic Patterns Improve Information Extraction for Medical Search
Roma Patel | Yinfei Yang | Iain Marshall | Ani Nenkova | Byron Wallace
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Medical professionals search the published literature by specifying the type of patients, the medical intervention(s) and the outcome measure(s) of interest. In this paper we demonstrate how features encoding syntactic patterns improve the performance of state-of-the-art sequence tagging models (both neural and linear) for information extraction of these medically relevant categories. We present an analysis of the type of patterns exploited and of the semantic space induced for these, i.e., the distributed representations learned for identified multi-token patterns. We show that these learned representations differ substantially from those of the constituent unigrams, suggesting that the patterns capture contextual information that is otherwise lost.

pdf
Evaluating Multiple System Summary Lengths: A Case Study
Ori Shapira | David Gabay | Hadar Ronen | Judit Bar-Ilan | Yael Amsterdamer | Ani Nenkova | Ido Dagan
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Practical summarization systems are expected to produce summaries of varying lengths, per user needs. While a couple of early summarization benchmarks tested systems across multiple summary lengths, this practice was mostly abandoned due to the assumed cost of producing reference summaries of multiple lengths. In this paper, we raise the research question of whether reference summaries of a single length can be used to reliably evaluate system summaries of multiple lengths. For that, we have analyzed a couple of datasets as a case study, using several variants of the ROUGE metric that are standard in summarization evaluation. Our findings indicate that the evaluation protocol in question is indeed competitive. This result paves the way to practically evaluating varying-length summaries with simple, possibly existing, summarization benchmarks.

2017

pdf
Aggregating and Predicting Sequence Labels from Crowd Annotations
An Thanh Nguyen | Byron Wallace | Junyi Jessy Li | Ani Nenkova | Matthew Lease
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite sequences being core to NLP, scant work has considered how to handle noisy sequence labels from multiple annotators for the same text. Given such annotations, we consider two complementary tasks: (1) aggregating sequential crowd labels to infer a best single set of consensus annotations; and (2) using crowd annotations as training data for a model that can predict sequences in unannotated text. For aggregation, we propose a novel Hidden Markov Model variant. To predict sequences in unannotated text, we propose a neural approach using Long Short Term Memory. We evaluate a suite of methods across two different applications and text genres: Named-Entity Recognition in news articles and Information Extraction from biomedical abstracts. Results show improvement over strong baselines. Our source code and data are available online.

pdf
Detecting (Un)Important Content for Single-Document News Summarization
Yinfei Yang | Forrest Bao | Ani Nenkova
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a robust approach for detecting intrinsic sentence importance in news, by training on two corpora of document-summary pairs. When used for single-document summarization, our approach, combined with the “beginning of document” heuristic, outperforms a state-of-the-art summarizer and the beginning-of-article baseline in both automatic and manual evaluations. These results represent an important advance because in the absence of cross-document repetition, single document summarizers for news have not been able to consistently outperform the strong beginning-of-article baseline.

2016

pdf
Improving the Annotation of Sentence Specificity
Junyi Jessy Li | Bridget O’Daniel | Yi Wu | Wenli Zhao | Ani Nenkova
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce improved guidelines for annotation of sentence specificity, addressing the issues encountered in prior work. Our annotation provides judgements of sentences in context. Rather than binary judgements, we introduce a specificity scale which accommodates nuanced judgements. Our augmented annotation procedure also allows us to define where in the discourse context the lack of specificity can be resolved. In addition, the cause of the underspecification is annotated in the form of free text questions. We present results from a pilot annotation with this new scheme and demonstrate good inter-annotator agreement. We found that the lack of specificity distributes evenly among immediate prior context, long distance prior context and no prior context. We find that missing details that are not resolved in the the prior context are more likely to trigger questions about the reason behind events, “why” and “how”. Our data is accessible at http://www.cis.upenn.edu/~nlp/corpora/lrec16spec.html

pdf bib
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Kevin Knight | Ani Nenkova | Owen Rambow
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
The Instantiation Discourse Relation: A Corpus Analysis of Its Properties and Improved Detection
Junyi Jessy Li | Ani Nenkova
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Phrase Generalization: a Corpus Study in Multi-Document Abstracts and Original News Alignments
Ariani Di-Felippo | Ani Nenkova
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

2015

pdf
Inducing Lexical Style Properties for Paraphrase and Genre Differentiation
Ellie Pavlick | Ani Nenkova
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Identification and Characterization of Newsworthy Verbs in World News
Benjamin Nye | Ani Nenkova
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
System Combination for Multi-document Summarization
Kai Hong | Mitchell Marcus | Ani Nenkova
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Detecting Content-Heavy Sentences: A Cross-Language Case Study
Junyi Jessy Li | Ani Nenkova
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf
Verbose, Laconic or Just Right: A Simple Computational Model of Content Appropriateness under Length Constraints
Annie Louis | Ani Nenkova
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Improving the Estimation of Word Importance for News Multi-Document Summarization
Kai Hong | Ani Nenkova
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)
Sandra Williams | Advaith Siddharthan | Ani Nenkova
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
Proceedings of the 5th Workshop on Speech and Language Processing for Assistive Technologies
Jan Alexandersson | Dimitra Anastasiou | Cui Jian | Ani Nenkova | Rupal Patel | Frank Rudzicz | Annalu Waller | Desislava Zhekova
Proceedings of the 5th Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)
Kallirroi Georgila | Matthew Stone | Helen Hastie | Ani Nenkova
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

pdf
Addressing Class Imbalance for Improved Recognition of Implicit Discourse Relations
Junyi Jessy Li | Ani Nenkova
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

pdf
Reducing Sparsity Improves the Recognition of Implicit Discourse Relations
Junyi Jessy Li | Ani Nenkova
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

pdf
A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization
Kai Hong | John Conroy | Benoit Favre | Alex Kulesza | Hui Lin | Ani Nenkova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In the period since 2004, many novel sophisticated approaches for generic multi-document summarization have been developed. Intuitive simple approaches have also been shown to perform unexpectedly well for the task. Yet it is practically impossible to compare the existing approaches directly, because systems have been evaluated on different datasets, with different evaluation measures, against different sets of comparison systems. Here we present a corpus of summaries produced by several state-of-the-art extractive summarization systems or by popular baseline systems. The inputs come from the 2004 DUC evaluation, the latest year in which generic summarization was addressed in a shared task. We use the same settings for ROUGE automatic evaluation to compare the systems directly and analyze the statistical significance of the differences in performance. We show that in terms of average scores the state-of-the-art systems appear similar but that in fact they produce very different summaries. Our corpus will facilitate future research on generic summarization and motivates the need for development of more sensitive evaluation measures and for approaches to system combination in summarization.

pdf
Assessing the Discourse Factors that Influence the Quality of Machine Translation
Junyi Jessy Li | Marine Carpuat | Ani Nenkova
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Cross-lingual Discourse Relation Analysis: A corpus study and a semi-supervised classification system
Junyi Jessy Li | Marine Carpuat | Ani Nenkova
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain
Annie Louis | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 1

Great writing is rare and highly admired. Readers seek out articles that are beautifully written, informative and entertaining. Yet information-access technologies lack capabilities for predicting article quality at this level. In this paper we present first experiments on article quality prediction in the science journalism domain. We introduce a corpus of great pieces of science journalism, along with typical articles from the genre. We implement features to capture aspects of great writing, including surprising, visual and emotional content, as well as general features related to discourse organization and sentence structure. We show that the distinction between great and typical articles can be detected fairly accurately, and that the entire spectrum of our features contribute to the distinction.

pdf bib
Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations
Sandra Williams | Advaith Siddharthan | Ani Nenkova
Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations

pdf bib
Automatically Assessing Machine Summary Content Without a Gold Standard
Annie Louis | Ani Nenkova
Computational Linguistics, Volume 39, Issue 2 - June 2013

pdf
A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art
Peter A. Rankel | John M. Conroy | Hoa Trang Dang | Ani Nenkova
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf
A corpus of general and specific sentences from news
Annie Louis | Ani Nenkova
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a corpus of sentences from news articles that are annotated as general or specific. We employed annotators on Amazon Mechanical Turk to mark sentences from three kinds of news articles―reports on events, finance news and science journalism. We introduce the resulting corpus, with focus on annotator agreement, proportion of general/specific sentences in the articles and results for automatic classification of the two sentence types.

pdf
Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls
Kai Hong | Christian G. Kohler | Mary E. March | Amber A. Parker | Ani Nenkova
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf
A Coherence Model Based on Syntactic Patterns
Annie Louis | Ani Nenkova
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Acoustic-Prosodic Entrainment and Social Behavior
Rivka Levitan | Agustín Gravano | Laura Willson | S̆tefan Ben̆us̆ | Julia Hirschberg | Ani Nenkova
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the NAACL HLT 2012 Student Research Workshop
Rivka Levitan | Myle Ott | Roger Levy | Ani Nenkova
Proceedings of the NAACL HLT 2012 Student Research Workshop

pdf bib
Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations
Sandra Williams | Advaith Siddharthan | Ani Nenkova
Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations

pdf bib
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization
John M. Conroy | Hoa Trang Dang | Ani Nenkova | Karolina Owczarzak
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization

pdf bib
An Assessment of the Accuracy of Automatic Evaluation in Summarization
Karolina Owczarzak | John M. Conroy | Hoa Trang Dang | Ani Nenkova
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization

2011

pdf
Automatic identification of general and specific sentences by leveraging discourse annotations
Annie Louis | Ani Nenkova
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages
Ani Nenkova | Julia Hirschberg | Yang Liu
Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages

pdf
Text Specificity and Impact on Quality of News Summaries
Annie Louis | Ani Nenkova
Proceedings of the Workshop on Monolingual Text-To-Text Generation

pdf
Information Status Distinctions and Referring Expressions: An Empirical Study of References to People in News Summaries
Advaith Siddharthan | Ani Nenkova | Kathleen McKeown
Computational Linguistics, Volume 37, Issue 4 - December 2011

pdf
Automatic Summarization
Ani Nenkova | Sameer Maskey | Yang Liu
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

2010

pdf
Creating Local Coherence: An Empirical Assessment
Annie Louis | Ani Nenkova
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf
Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
Emily Pitler | Annie Louis | Ani Nenkova
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
Using entity features to classify implicit discourse relations
Annie Louis | Aravind Joshi | Rashmi Prasad | Ani Nenkova
Proceedings of the SIGDIAL 2010 Conference

pdf
Discourse indicators for content selection in summarization
Annie Louis | Aravind Joshi | Ani Nenkova
Proceedings of the SIGDIAL 2010 Conference

2009

pdf
Automatic sense prediction for implicit discourse relations in text
Emily Pitler | Annie Louis | Ani Nenkova
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
Using Syntax to Disambiguate Explicit Discourse Connectives in Text
Emily Pitler | Ani Nenkova
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf
Automatically Evaluating Content Selection in Summarization without Human Models
Annie Louis | Ani Nenkova
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
Predicting the Fluency of Text with Shallow Structural Features: Case Studies of Machine Translation and Human-Written Text
Jieun Chae | Ani Nenkova
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf
Performance Confidence Estimation for Automatic Summarization
Annie Louis | Ani Nenkova
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf
Easily Identifiable Discourse Relations
Emily Pitler | Mridhula Raghupathy | Hena Mehta | Ani Nenkova | Alan Lee | Aravind Joshi
Coling 2008: Companion volume: Posters

pdf
Entity-driven Rewrite for Multi-document Summarization
Ani Nenkova
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf
Can You Summarize This? Identifying Correlates of Input Difficulty for Multi-Document Summarization
Ani Nenkova | Annie Louis
Proceedings of ACL-08: HLT

pdf
High Frequency Word Entrainment in Spoken Dialogue
Ani Nenkova | Agustín Gravano | Julia Hirschberg
Proceedings of ACL-08: HLT, Short Papers

pdf bib
Tutorial Abstracts of ACL-08: HLT
Ani Nenkova | Marilyn Walker | Eugene Agichtein
Tutorial Abstracts of ACL-08: HLT

pdf
Revisiting Readability: A Unified Framework for Predicting Text Quality
Emily Pitler | Ani Nenkova
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf
Measuring Importance and Query Relevance in Topic-focused Multi-document Summarization
Surabhi Gupta | Ani Nenkova | Dan Jurafsky
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf bib
To Memorize or to Predict: Prominence labeling in Conversational Speech
Ani Nenkova | Jason Brenier | Anubha Kothari | Sasha Calhoun | Laura Whitton | David Beaver | Dan Jurafsky
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

2005

pdf
Automatically Learning Cognitive Status for Multi-Document Summarization of Newswire
Ani Nenkova | Advaith Siddharthan | Kathleen McKeown
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf
Syntactic Simplification for Improving Content Selection in Multi-Document Summarization
Advaith Siddharthan | Ani Nenkova | Kathleen McKeown
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf
Evaluating Content Selection in Summarization: The Pyramid Method
Ani Nenkova | Rebecca Passonneau
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

2003

pdf
References to Named Entities: a Corpus Study
Ani Nenkova | Kathleen McKeown
Companion Volume of the Proceedings of HLT-NAACL 2003 - Short Papers

pdf
Columbia’s Newsblaster: New Features and Future Directions
Kathleen McKeown | Regina Barzilay | John Chen | David Elson | David Evans | Judith Klavans | Ani Nenkova | Barry Schiffman | Sergey Sigelman
Companion Volume of the Proceedings of HLT-NAACL 2003 - Demonstrations

Search
Co-authors