Nazli Goharian


2022

pdf
ABNIRML: Analyzing the Behavior of Neural IR Models
Sean MacAvaney | Sergey Feldman | Nazli Goharian | Doug Downey | Arman Cohan
Transactions of the Association for Computational Linguistics, Volume 10

Pretrained contextualized language models such as BERT and T5 have established a new state-of-the-art for ad-hoc search. However, it is not yet well understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic probes that allow us to test several characteristics—such as writing styles, factuality, sensitivity to paraphrasing and word order—that are not addressed by previous techniques. To demonstrate the value of the framework, we conduct an extensive empirical study that yields insights into the factors that contribute to the neural model’s gains, and identify potential unintended biases the models exhibit. Some of our results confirm conventional wisdom, for example, that recent neural ranking models rely less on exact term overlap with the query, and instead leverage richer linguistic information, evidenced by their higher sensitivity to word and sentence order. Other results are more surprising, such as that some models (e.g., T5 and ColBERT) are biased towards factually correct (rather than simply relevant) texts. Further, some characteristics vary even for the same base language model, and other characteristics can appear due to random variations during model training.1

pdf
Curriculum-guided Abstractive Summarization for Mental Health Online Posts
Sajad Sotudeh | Nazli Goharian | Hanieh Deilamsalehy | Franck Dernoncourt
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)

Automatically generating short summaries from users’ online mental health posts could save counselors’ reading time and reduce their fatigue so that they can provide timely responses to those seeking help for improving their mental state. Recent Transformers-based summarization models have presented a promising approach to abstractive summarization. They go beyond sentence selection and extractive strategies to deal with more complicated tasks such as novel word generation and sentence paraphrasing. Nonetheless, these models have a prominent shortcoming; their training strategy is not quite efficient, which restricts the model’s performance. In this paper, we include a curriculum learning approach to reweigh the training samples, bringing about an efficient learning procedure. We apply our model on extreme summarization dataset of MentSum posts —-a dataset of mental health related posts from Reddit social media. Compared to the state-of-the-art model, our proposed method makes substantial gains in terms of Rouge and Bertscore evaluation metrics, yielding 3.5% Rouge-1, 10.4% Rouge-2, and 4.7% Rouge-L, 1.5% Bertscore relative improvements.

pdf
TSTR: Too Short to Represent, Summarize with Details! Intro-Guided Extended Summary Generation
Sajad Sotudeh | Nazli Goharian
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Many scientific papers such as those in arXiv and PubMed data collections have abstracts with varying lengths of 50-1000 words and average length of approximately 200 words, where longer abstracts typically convey more information about the source paper. Up to recently, scientific summarization research has typically focused on generating short, abstract-like summaries following the existing datasets used for scientific summarization. In domains where the source text is relatively long-form, such as in scientific documents, such summary is not able to go beyond the general and coarse overview and provide salient information from the source document. The recent interest to tackle this problem motivated curation of scientific datasets, arXiv-Long and PubMed-Long, containing human-written summaries of 400-600 words, hence, providing a venue for research in generating long/extended summaries. Extended summaries facilitate a faster read while providing details beyond coarse information. In this paper, we propose TSTR, an extractive summarizer that utilizes the introductory information of documents as pointers to their salient information. The evaluations on two existing large-scale extended summarization datasets indicate statistically significant improvement in terms of Rouge and average Rouge (F1) scores (except in one case) as compared to strong baselines and state-of-the-art. Comprehensive human evaluations favor our generated extended summaries in terms of cohesion and completeness.

pdf
GUIR @ MuP 2022: Towards Generating Topic-aware Multi-perspective Summaries for Scientific Documents
Sajad Sotudeh | Nazli Goharian
Proceedings of the Third Workshop on Scholarly Document Processing

This paper presents our approach for the MuP 2022 shared task —-Multi-Perspective Scientific Document Summarization, where the objective is to enable summarization models to explore methods for generating multi-perspective summaries for scientific papers. We explore two orthogonal ways to cope with this task. The first approach involves incorporating a neural topic model (i.e., NTM) into the state-of-the-art abstractive summarizer (LED); the second approach involves adding a two-step summarizer that extracts the salient sentences from the document and then writes abstractive summaries from those sentences. Our latter model outperformed our other submissions on the official test set. Specifically, among 10 participants (including organizers’ baseline) who made their results public with 163 total runs. Our best system ranks first in Rouge-1 (F), and second in Rouge-1 (R), Rouge-2 (F) and Average Rouge (F) scores.

pdf
TBD3: A Thresholding-Based Dynamic Depression Detection from Social Media for Low-Resource Users
Hrishikesh Kulkarni | Sean MacAvaney | Nazli Goharian | Ophir Frieder
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Social media are heavily used by many users to share their mental health concerns and diagnoses. This trend has turned social media into a large-scale resource for researchers focused on detecting mental health conditions. Social media usage varies considerably across individuals. Thus, classification of patterns, including detecting signs of depression, must account for such variation. We address the disparity in classification effectiveness for users with little activity (e.g., new users). Our evaluation, performed on a large-scale dataset, shows considerable detection discrepancy based on user posting frequency. For instance, the F1 detection score of users with an above-median versus below-median number of posts is greater than double (0.803 vs 0.365) using a conventional CNN-based model; similar results were observed on lexical and transformer-based classifiers. To complement this evaluation, we propose a dynamic thresholding technique that adjusts the classifier’s sensitivity as a function of the number of posts a user has. This technique alone reduces the margin between users with many and few posts, on average, by 45% across all methods and increases overall performance, on average, by 33%. These findings emphasize the importance of evaluating and tuning natural language systems for potentially vulnerable populations.

pdf
MentSum: A Resource for Exploring Summarization of Mental Health Online Posts
Sajad Sotudeh | Nazli Goharian | Zachary Young
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Mental health remains a significant challenge of public health worldwide. With increasing popularity of online platforms, many use the platforms to share their mental health conditions, express their feelings, and seek help from the community and counselors. Some of these platforms, such as Reachout, are dedicated forums where the users register to seek help. Others such as Reddit provide subreddits where the users publicly but anonymously post their mental health distress. Although posts are of varying length, it is beneficial to provide a short, but informative summary for fast processing by the counselors. To facilitate research in summarization of mental health online posts, we introduce Mental Health Summarization dataset, MentSum, containing over 24k carefully selected user posts from Reddit, along with their short user-written summary (called TLDR) in English from 43 mental health subreddits. This domain-specific dataset could be of interest not only for generating short summaries on Reddit, but also for generating summaries of posts on the dedicated mental health forums such as Reachout. We further evaluate both extractive and abstractive state-of-the-art summarization baselines in terms of Rouge scores, and finally conduct an in-depth human evaluation study of both user-written and system-generated summaries, highlighting challenges in this research.

2021

pdf bib
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access
Nazli Goharian | Philip Resnik | Andrew Yates | Molly Ireland | Kate Niederhoffer | Rebecca Resnik
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

pdf bib
ToxCCIn: Toxic Content Classification with Interpretability
Tong Xiang | Sean MacAvaney | Eugene Yang | Nazli Goharian
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Despite the recent successes of transformer-based models in terms of effectiveness on a variety of tasks, their decisions often remain opaque to humans. Explanations are particularly important for tasks like offensive language or toxicity detection on social media because a manual appeal process is often in place to dispute automatically flagged content. In this work, we propose a technique to improve the interpretability of these models, based on a simple and powerful assumption: a post is at least as toxic as its most toxic span. We incorporate this assumption into transformer models by scoring a post based on the maximum toxicity of its spans and augmenting the training process to identify correct spans. We find this approach effective and can produce explanations that exceed the quality of those provided by Logistic Regression analysis (often regarded as a highly-interpretable model), according to a human study.

pdf
TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts
Sajad Sotudeh | Hanieh Deilamsalehy | Franck Dernoncourt | Nazli Goharian
Proceedings of the Third Workshop on New Frontiers in Summarization

Recent models in developing summarization systems consist of millions of parameters and the model performance is highly dependent on the abundance of training data. While most existing summarization corpora contain data in the order of thousands to one million, generation of large-scale summarization datasets in order of couple of millions is yet to be explored. Practically, more data is better at generalizing the training patterns to unseen data. In this paper, we introduce TLDR9+ –a large-scale summarization dataset– containing over 9 million training instances extracted from Reddit discussion forum ([HTTP]). This dataset is specifically gathered to perform extreme summarization (i.e., generating one-sentence summary in high compression and abstraction) and is more than twice larger than the previously proposed dataset. We go one step further and with the help of human annotations, we distill a more fine-grained dataset by sampling High-Quality instances from TLDR9+ and call it TLDRHQ dataset. We further pinpoint different state-of-the-art summarization models on our proposed datasets.

2020

pdf
Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization
Sajad Sotudeh Gharebagh | Nazli Goharian | Ross Filice
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Sequence-to-sequence (seq2seq) network is a well-established model for text summarization task. It can learn to produce readable content; however, it falls short in effectively identifying key regions of the source. In this paper, we approach the content selection problem for clinical abstractive summarization by augmenting salient ontological terms into the summarizer. Our experiments on two publicly available clinical data sets (107,372 reports of MIMIC-CXR, and 3,366 reports of OpenI) show that our model statistically significantly boosts state-of-the-art results in terms of ROUGE metrics (with improvements: 2.9% RG-1, 2.5% RG-2, 1.9% RG-L), in the healthcare domain where any range of improvement impacts patients’ welfare.

pdf
SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search
Sean MacAvaney | Arman Cohan | Nazli Goharian
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

With worldwide concerns surrounding the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), there is a rapidly growing body of scientific literature on the virus. Clinicians, researchers, and policy-makers need to be able to search these articles effectively. In this work, we present a zero-shot ranking algorithm that adapts to COVID-related scientific literature. Our approach filters training data from another collection down to medical-related queries, uses a neural re-ranking model pre-trained on scientific text (SciBERT), and filters the target document collection. This approach ranks top among zero-shot methods on the TREC COVID Round 1 leaderboard, and exhibits a P@5 of 0.80 and an nDCG@10 of 0.68 when evaluated on both Round 1 and 2 judgments. Despite not relying on TREC-COVID data, our method outperforms models that do. As one of the first search methods to thoroughly evaluate COVID-19 search, we hope that this serves as a strong baseline and helps in the global crisis.

pdf
GUIR @ LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents
Sajad Sotudeh Gharebagh | Arman Cohan | Nazli Goharian
Proceedings of the First Workshop on Scholarly Document Processing

This paper presents our methods for the LongSumm 2020: Shared Task on Generating Long Summaries for Scientific Documents, where the task is to generatelong summaries given a set of scientific papers provided by the organizers. We explore 3 main approaches for this task: 1. An extractive approach using a BERT-based summarization model; 2. A two stage model that additionally includes an abstraction step using BART; and 3. A new multi-tasking approach on incorporating document structure into the summarizer. We found that our new multi-tasking approach outperforms the two other methods by large margins. Among 9 participants in the shared task, our best model ranks top according to Rouge-1 score (53.11%) while staying competitive in terms of Rouge-2.

pdf
Team DoNotDistribute at SemEval-2020 Task 11: Features, Finetuning, and Data Augmentation in Neural Models for Propaganda Detection in News Articles
Michael Kranzlein | Shabnam Behzad | Nazli Goharian
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents our systems for SemEval 2020 Shared Task 11: Detection of Propaganda Techniques in News Articles. We participate in both the span identification and technique classification subtasks and report on experiments using different BERT-based models along with handcrafted features. Our models perform well above the baselines for both tasks, and we contribute ablation studies and discussion of our results to dissect the effectiveness of different features and techniques with the goal of aiding future studies in propaganda detection.

pdf
GUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language Detection
Sajad Sotudeh | Tong Xiang | Hao-Ren Yao | Sean MacAvaney | Eugene Yang | Nazli Goharian | Ophir Frieder
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Offensive language detection is an important and challenging task in natural language processing. We present our submissions to the OffensEval 2020 shared task, which includes three English sub-tasks: identifying the presence of offensive language (Sub-task A), identifying the presence of target in offensive language (Sub-task B), and identifying the categories of the target (Sub-task C). Our experiments explore using a domain-tuned contextualized language model (namely, BERT) for this task. We also experiment with different components and configurations (e.g., a multi-view SVM) stacked upon BERT models for specific sub-tasks. Our submissions achieve F1 scores of 91.7% in Sub-task A, 66.5% in Sub-task B, and 63.2% in Sub-task C. We perform an ablation study which reveals that domain tuning considerably improves the classification performance. Furthermore, error analysis shows common misclassification errors made by our model and outlines research directions for future.

2018

pdf
RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses
Sean MacAvaney | Bart Desmet | Arman Cohan | Luca Soldaini | Andrew Yates | Ayah Zirikly | Nazli Goharian
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

Self-reported diagnosis statements have been widely employed in studying language related to mental health in social media. However, existing research has largely ignored the temporality of mental health diagnoses. In this work, we introduce RSDD-Time: a new dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis. Annotations include whether a mental health condition is present and how recently the diagnosis happened. Furthermore, we include exact temporal spans that relate to the date of diagnosis. This information is valuable for various computational methods to examine mental health through social media because one’s mental health state is not static. We also test several baseline classification and extraction approaches, which suggest that extracting temporal information from self-reported diagnosis statements is challenging.

pdf
Helping or Hurting? Predicting Changes in Users’ Risk of Self-Harm Through Online Community Interactions
Luca Soldaini | Timothy Walsh | Arman Cohan | Julien Han | Nazli Goharian
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

In recent years, online communities have formed around suicide and self-harm prevention. While these communities offer support in moment of crisis, they can also normalize harmful behavior, discourage professional treatment, and instigate suicidal ideation. In this work, we focus on how interaction with others in such a community affects the mental state of users who are seeking support. We first build a dataset of conversation threads between users in a distressed state and community members offering support. We then show how to construct a classifier to predict whether distressed users are helped or harmed by the interactions in the thread, and we achieve a macro-F1 score of up to 0.69.

pdf
GU IRLAB at SemEval-2018 Task 7: Tree-LSTMs for Scientific Relation Classification
Sean MacAvaney | Luca Soldaini | Arman Cohan | Nazli Goharian
Proceedings of the 12th International Workshop on Semantic Evaluation

SemEval 2018 Task 7 focuses on relation extraction and classification in scientific literature. In this work, we present our tree-based LSTM network for this shared task. Our approach placed 9th (of 28) for subtask 1.1 (relation classification), and 5th (of 20) for subtask 1.2 (relation classification with noisy entities). We also provide an ablation study of features included as input to the network.

pdf
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
Arman Cohan | Franck Dernoncourt | Doo Soon Kim | Trung Bui | Seokhwan Kim | Walter Chang | Nazli Goharian
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.

pdf
SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions
Arman Cohan | Bart Desmet | Andrew Yates | Luca Soldaini | Sean MacAvaney | Nazli Goharian
Proceedings of the 27th International Conference on Computational Linguistics

Mental health is a significant and growing public health concern. As language usage can be leveraged to obtain crucial insights into mental health conditions, there is a need for large-scale, labeled, mental health-related datasets of users who have been diagnosed with one or more of such conditions. In this paper, we investigate the creation of high-precision patterns to identify self-reported diagnoses of nine different mental health conditions, and obtain high-quality labeled data without the need for manual labelling. We introduce the SMHD (Self-reported Mental Health Diagnoses) dataset and make it available. SMHD is a novel large dataset of social media posts from users with one or multiple mental health conditions along with matched control users. We examine distinctions in users’ language, as measured by linguistic and psychological variables. We further explore text classification methods to identify individuals with mental conditions through their language.

2017

pdf
Depression and Self-Harm Risk Assessment in Online Forums
Andrew Yates | Arman Cohan | Nazli Goharian
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Users suffering from mental health conditions often turn to online resources for support, including specialized online support communities or general communities such as Twitter and Reddit. In this work, we present a framework for supporting and studying users in both types of communities. We propose methods for identifying posts in support communities that may indicate a risk of self-harm, and demonstrate that our approach outperforms strong previously proposed methods for identifying such posts. Self-harm is closely related to depression, which makes identifying depressed users on general forums a crucial related task. We introduce a large-scale general forum dataset consisting of users with self-reported depression diagnoses matched with control users. We show how our method can be applied to effectively identify depressed users from their use of language alone. We demonstrate that our method outperforms strong baselines on this general forum dataset.

pdf
GUIR at SemEval-2017 Task 12: A Framework for Cross-Domain Clinical Temporal Information Extraction
Sean MacAvaney | Arman Cohan | Nazli Goharian
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Clinical TempEval 2017 (SemEval 2017 Task 12) addresses the task of cross-domain temporal extraction from clinical text. We present a system for this task that uses supervised learning for the extraction of temporal expression and event spans with corresponding attributes and narrative container relations. Approaches include conditional random fields and decision tree ensembles, using lexical, syntactic, semantic, distributional, and rule-based features. Our system received best or second best scores in TIMEX3 span, EVENT span, and CONTAINS relation extraction.

2016

pdf
GUIR at SemEval-2016 task 12: Temporal Information Processing for Clinical Narratives
Arman Cohan | Kevin Meurer | Nazli Goharian
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Revisiting Summarization Evaluation for Scientific Articles
Arman Cohan | Nazli Goharian
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Evaluation of text summarization approaches have been mostly based on metrics that measure similarities of system generated summaries with a set of human written gold-standard summaries. The most widely used metric in summarization evaluation has been the ROUGE family. ROUGE solely relies on lexical overlaps between the terms and phrases in the sentences; therefore, in cases of terminology variations and paraphrasing, ROUGE is not as effective. Scientific article summarization is one such case that is different from general domain summarization (e.g. newswire data). We provide an extensive analysis of ROUGE’s effectiveness as an evaluation metric for scientific summarization; we show that, contrary to the common belief, ROUGE is not much reliable in evaluating scientific summaries. We furthermore show how different variants of ROUGE result in very different correlations with the manual Pyramid scores. Finally, we propose an alternative metric for summarization evaluation which is based on the content relevance between a system generated summary and the corresponding human written summaries. We call our metric SERA (Summarization Evaluation by Relevance Analysis). Unlike ROUGE, SERA consistently achieves high correlations with manual scores which shows its effectiveness in evaluation of scientific article summarization.

pdf
Effects of Sampling on Twitter Trend Detection
Andrew Yates | Alek Kolcz | Nazli Goharian | Ophir Frieder
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Much research has focused on detecting trends on Twitter, including health-related trends such as mentions of Influenza-like illnesses or their symptoms. The majority of this research has been conducted using Twitter’s public feed, which includes only about 1% of all public tweets. It is unclear if, when, and how using Twitter’s 1% feed has affected the evaluation of trend detection methods. In this work we use a larger feed to investigate the effects of sampling on Twitter trend detection. We focus on using health-related trends to estimate the prevalence of Influenza-like illnesses based on tweets. We use ground truth obtained from the CDC and Google Flu Trends to explore how the prevalence estimates degrade when moving from a 100% to a 1% sample. We find that using the 1% sample is unlikely to substantially harm ILI estimates made at the national level, but can cause poor performance when estimates are made at the city level.

pdf
Triaging Mental Health Forum Posts
Arman Cohan | Sydney Young | Nazli Goharian
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

2015

pdf
Scientific Article Summarization Using Citation-Context and Article’s Discourse Structure
Arman Cohan | Nazli Goharian
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Matching Citation Text and Cited Spans in Biomedical Literature: a Search-Oriented Approach
Arman Cohan | Luca Soldaini | Nazli Goharian
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf
A Framework for Public Health Surveillance
Andrew Yates | Jon Parker | Nazli Goharian | Ophir Frieder
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

With the rapid growth of social media, there is increasing potential to augment traditional public health surveillance methods with data from social media. We describe a framework for performing public health surveillance on Twitter data. Our framework, which is publicly available, consists of three components that work together to detect health-related trends in social media: a concept extraction component for identifying health-related concepts, a concept aggregation component for identifying how the extracted health-related concepts relate to each other, and a trend detection component for determining when the aggregated health-related concepts are trending. We describe the architecture of the framework and several components that have been implemented in the framework, identify other components that could be used with the framework, and evaluate our framework on approximately 1.5 years of tweets. While it is difficult to determine how accurately a Twitter trend reflects a trend in the real world, we discuss the differences in trends detected by several different methods and compare flu trends detected by our framework to data from Google Flu Trends.