Jana Diesner


What changed? Investigating Debiasing Methods using Causal Mediation Analysis
Sullam Jeoung | Jana Diesner
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Previous work has examined how debiasing language models affect downstream tasks, specifically, how debiasing techniques influence task performance and whether debiased models also make impartial predictions in downstream tasks or not. However, what we don’t understand well yet is why debiasing methods have varying impacts on downstream tasks and how debiasing techniques affect internal components of language models, i.e., neurons, layers, and attentions. In this paper, we decompose the internal mechanisms of debiasing language models with respect to gender by applying causal mediation analysis to understand the influence of debiasing methods on toxicity detection as a downstream task. Our findings suggest a need to test the effectiveness of debiasing methods with different bias metrics, and to focus on changes in the behavior of certain components of the models, e.g.,first two layers of language models, and attention heads.


BACO: A Background Knowledge- and Content-Based Framework for Citing Sentence Generation
Yubin Ge | Ly Dinh | Xiaofeng Liu | Jinsong Su | Ziyao Lu | Ante Wang | Jana Diesner
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper, we focus on the problem of citing sentence generation, which entails generating a short text to capture the salient information in a cited paper and the connection between the citing and cited paper. We present BACO, a BAckground knowledge- and COntent-based framework for citing sentence generation, which considers two types of information: (1) background knowledge by leveraging structural information from a citation network; and (2) content, which represents in-depth information about what to cite and why to cite. First, a citation network is encoded to provide background knowledge. Second, we apply salience estimation to identify what to cite by estimating the importance of sentences in the cited paper. During the decoding stage, both types of information are combined to facilitate the text generation, and then we conduct a joint training for the generator and citation function classification to make the model aware of why to cite. Our experimental results show that our framework outperforms comparative baselines.


Beyond Citations: Corpus-based Methods for Detecting the Impact of Research Outcomes on Society
Rezvaneh Rezapour | Jutta Bopp | Norman Fiedler | Diana Steffen | Andreas Witt | Jana Diesner
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper proposes, implements and evaluates a novel, corpus-based approach for identifying categories indicative of the impact of research via a deductive (top-down, from theory to data) and an inductive (bottom-up, from data to theory) approach. The resulting categorization schemes differ in substance. Research outcomes are typically assessed by using bibliometric methods, such as citation counts and patterns, or alternative metrics, such as references to research in the media. Shortcomings with these methods are their inability to identify impact of research beyond academia (bibliometrics) and considering text-based impact indicators beyond those that capture attention (altmetrics). We address these limitations by leveraging a mixed-methods approach for eliciting impact categories from experts, project personnel (deductive) and texts (inductive). Using these categories, we label a corpus of project reports per category schema, and apply supervised machine learning to infer these categories from project reports. The classification results show that we can predict deductively and inductively derived impact categories with 76.39% and 78.81% accuracy (F1-score), respectively. Our approach can complement solutions from bibliometrics and scientometrics for assessing the impact of research and studying the scope and types of advancements transferred from academia to society.

An Empirical Methodology for Detecting and Prioritizing Needs during Crisis Events
M. Janina Sarol | Ly Dinh | Rezvaneh Rezapour | Chieh-Li Chin | Pingjing Yang | Jana Diesner
Findings of the Association for Computational Linguistics: EMNLP 2020

In times of crisis, identifying essential needs is crucial to providing appropriate resources and services to affected entities. Social media platforms such as Twitter contain a vast amount of information about the general public’s needs. However, the sparsity of information and the amount of noisy content present a challenge for practitioners to effectively identify relevant information on these platforms. This study proposes two novel methods for two needs detection tasks: 1) extracting a list of needed resources, such as masks and ventilators, and 2) detecting sentences that specify who-needs-what resources (e.g., we need testing). We evaluate our methods on a set of tweets about the COVID-19 crisis. For extracting a list of needs, we compare our results against two official lists of resources, achieving 0.64 precision. For detecting who-needs-what sentences, we compared our results against a set of 1,000 annotated tweets and achieved a 0.68 F1-score.


REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning
Ming Jiang | Junjie Hu | Qiuyuan Huang | Lei Zhang | Jana Diesner | Jianfeng Gao
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Popular metrics used for evaluating image captioning systems, such as BLEU and CIDEr, provide a single score to gauge the system’s overall effectiveness. This score is often not informative enough to indicate what specific errors are made by a given system. In this study, we present a fine-grained evaluation method REO for automatically measuring the performance of image captioning systems. REO assesses the quality of captions from three perspectives: 1) Relevance to the ground truth, 2) Extraness of the content that is irrelevant to the ground truth, and 3) Omission of the elements in the images and human references. Experiments on three benchmark datasets demonstrate that our method achieves a higher consistency with human judgments and provides more intuitive evaluation results than alternative metrics.

TIGEr: Text-to-Image Grounding for Image Caption Evaluation
Ming Jiang | Qiuyuan Huang | Lei Zhang | Xin Wang | Pengchuan Zhang | Zhe Gan | Jana Diesner | Jianfeng Gao
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions, potentially leading to biased evaluations because references may not fully cover the image content and natural language is inherently ambiguous. Building upon a machine-learned text-image grounding model, TIGEr allows to evaluate caption quality not only based on how well a caption represents image content, but also on how well machine-generated captions match human-generated captions. Our empirical tests show that TIGEr has a higher consistency with human judgments than alternative existing metrics. We also comprehensively assess the metric’s effectiveness in caption evaluation by measuring the correlation between human judgments and metric scores.

A Constituency Parsing Tree based Method for Relation Extraction from Abstracts of Scholarly Publications
Ming Jiang | Jana Diesner
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

We present a simple, rule-based method for extracting entity networks from the abstracts of scientific literature. By taking advantage of selected syntactic features of constituent parsing trees, our method automatically extracts and constructs graphs in which nodes represent text-based entities (in this case, noun phrases) and their relationships (in this case, verb phrases or preposition phrases). We use two benchmark datasets for evaluation and compare with previously presented results for these data. Our evaluation results show that the proposed method leads to accuracy rates that are comparable to or exceed the results achieved with state-of-the-art, learning-based methods in several cases.

Enhancing the Measurement of Social Effects by Capturing Morality
Rezvaneh Rezapour | Saumil H. Shah | Jana Diesner
Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

We investigate the relationship between basic principles of human morality and the expression of opinions in user-generated text data. We assume that people’s backgrounds, culture, and values are associated with their perceptions and expressions of everyday topics, and that people’s language use reflects these perceptions. While personal values and social effects are abstract and complex concepts, they have practical implications and are relevant for a wide range of NLP applications. To extract human values (in this paper, morality) and measure social effects (morality and stance), we empirically evaluate the usage of a morality lexicon that we expanded via a quality controlled, human in the loop process. As a result, we enhanced the Moral Foundations Dictionary in size (from 324 to 4,636 syntactically disambiguated entries) and scope. We used both lexica for feature-based and deep learning classification (SVM, RF, and LSTM) to test their usefulness for measuring social effects. We find that the enhancement of the original lexicon led to measurable improvements in prediction accuracy for the selected NLP tasks.


Telling Apart Tweets Associated with Controversial versus Non-Controversial Topics
Aseel Addawood | Rezvaneh Rezapour | Omid Abdar | Jana Diesner
Proceedings of the Second Workshop on NLP and Computational Social Science

In this paper, we evaluate the predictability of tweets associated with controversial versus non-controversial topics. As a first step, we crowd-sourced the scoring of a predefined set of topics on a Likert scale from non-controversial to controversial. Our feature set entails and goes beyond sentiment features, e.g., by leveraging empathic language and other features that have been previously used but are new for this particular study. We find focusing on the structural characteristics of tweets to be beneficial for this task. Using a combination of emphatic, language-specific, and Twitter-specific features for supervised learning resulted in 87% accuracy (F1) for cross-validation of the training set and 63.4% accuracy when using the test set. Our analysis shows that features specific to Twitter or social media, in general, are more prevalent in tweets on controversial topics than in non-controversial ones. To test the premise of the paper, we conducted two additional sets of experiments, which led to mixed results. This finding will inform our future investigations into the relationship between language use on social media and the perceived controversiality of topics.


Semi-supervised Named Entity Recognition in noisy-text
Shubhanshu Mishra | Jana Diesner
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Many of the existing Named Entity Recognition (NER) solutions are built based on news corpus data with proper syntax. These solutions might not lead to highly accurate results when being applied to noisy, user generated data, e.g., tweets, which can feature sloppy spelling, concept drift, and limited contextualization of terms and concepts due to length constraints. The models described in this paper are based on linear chain conditional random fields (CRFs), use the BIEOU encoding scheme, and leverage random feature dropout for up-sampling the training data. The considered features include word clusters and pre-trained distributed word representations, updated gazetteer features, and global context predictions. The latter feature allows for ingesting the meaning of new or rare tokens into the system via unsupervised learning and for alleviating the need to learn lexicon based features, which usually tend to be high dimensional. In this paper, we report on the solution [ST] we submitted to the WNUT 2016 NER shared task. We also present an improvement over our original submission [SI], which we built by using semi-supervised learning on labelled training data and pre-trained resourced constructed from unlabelled tweet data. Our ST solution achieved an F1 score of 1.2% higher than the baseline (35.1% F1) for the task of extracting 10 entity types. The SI resulted in an increase of 8.2% in F1 score over the base-line (7.08% over ST). Finally, the SI model’s evaluation on the test data achieved a F1 score of 47.3% (~1.15% increase over the 2nd best submitted solution). Our experimental setup and results are available as a standalone twitter NER tool at https://github.com/napsternxg/TwitterNER.

Says Who…? Identification of Expert versus Layman Critics’ Reviews of Documentary Films
Ming Jiang | Jana Diesner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We extend classic review mining work by building a binary classifier that predicts whether a review of a documentary film was written by an expert or a layman with 90.70% accuracy (F1 score), and compare the characteristics of the predicted classes. A variety of standard lexical and syntactic features was used for this supervised learning task. Our results suggest that experts write comparatively lengthier and more detailed reviews that feature more complex grammar and a higher diversity in their vocabulary. Layman reviews are more subjective and contextualized in peoples’ everyday lives. Our error analysis shows that laymen are about twice as likely to be mistaken as experts than vice versa. We argue that the type of author might be a useful new feature for improving the accuracy of predicting the rating, helpfulness and authenticity of reviews. Finally, the outcomes of this work might help researchers and practitioners in the field of impact assessment to gain a more fine-grained understanding of the perception of different types of media consumers and reviewers of a topic, genre or information product.


Unsupervised Construction of a Lexicon and a Repository of Variation Patterns for Arabic Modal Multiword Expressions
Rania Al-Sabbagh | Roxana Girju | Jana Diesner
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

Interactive Annotation for Event Modality in Modern Standard and Egyptian Arabic Tweets
Rania Al-Sabbagh | Roxana Girju | Jana Diesner
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

3arif: A Corpus of Modern Standard and Egyptian Arabic Tweets Annotated for Epistemic Modality Using Interactive Crowdsourcing
Rania Al-Sabbagh | Roxana Girju | Jana Diesner
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers


Using the Semantic-Syntactic Interface for Reliable Arabic Modality Annotation
Rania Al-Sabbagh | Jana Diesner | Roxana Girju
Proceedings of the Sixth International Joint Conference on Natural Language Processing