Sharon Levy


pdf bib
Addressing Issues of Cross-Linguality in Open-Retrieval Question Answering Systems For Emergent Domains
Alon Albalak | Sharon Levy | William Yang Wang
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Open-retrieval question answering systems are generally trained and tested on large datasets in well-established domains. However, low-resource settings such as new and emerging domains would especially benefit from reliable question answering systems. Furthermore, multilingual and cross-lingual resources in emergent domains are scarce, leading to few or no such systems.In this paper, we demonstrate a cross-lingual open-retrieval question answering system for the emergent domain of COVID-19.Our system adopts a corpus of scientific articles to ensure that retrieved documents are reliable. To address the scarcity of cross-lingual training data in emergent domains, we present a method utilizing automatic translation, alignment, and filtering to produce English-to-all datasets.We show that a deep semantic retriever greatly benefits from training on our English-to-all data and significantly outperforms a BM25 baseline in the cross-lingual setting.We illustrate the capabilities of our system with examples and release all code necessary to train and deploy such a system.


Towards Understanding Gender-Seniority Compound Bias in Natural Language Generation
Samhita Honnavalli | Aesha Parekh | Lily Ou | Sophie Groenwold | Sharon Levy | Vicente Ordonez | William Yang Wang
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Women are often perceived as junior to their male counterparts, even within the same job titles. While there has been significant progress in the evaluation of gender bias in natural language processing (NLP), existing studies seldom investigate how biases toward gender groups change when compounded with other societal biases. In this work, we investigate how seniority impacts the degree of gender bias exhibited in pretrained neural generation models by introducing a novel framework for probing compound bias. We contribute a benchmark robustness-testing dataset spanning two domains, U.S. senatorship and professorship, created using a distant-supervision method. Our dataset includes human-written text with underlying ground truth and paired counterfactuals. We then examine GPT-2 perplexity and the frequency of gendered language in generated text. Our results show that GPT-2 amplifies bias by considering women as junior and men as senior more often than the ground truth in both domains. These results suggest that NLP applications built using GPT-2 may harm women in professional capacities.

HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on Tabular and Textual Data
Kai Nakamura | Sharon Levy | Yi-Lin Tuan | Wenhu Chen | William Yang Wang
Findings of the Association for Computational Linguistics: ACL 2022

A pressing challenge in current dialogue systems is to successfully converse with users on topics with information distributed across different modalities. Previous work in multiturn dialogue systems has primarily focused on either text or table information. In more realistic scenarios, having a joint understanding of both is critical as knowledge is typically distributed over both unstructured and structured forms. We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables. The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions. We propose retrieval, system state tracking, and dialogue response generation tasks for our dataset and conduct baseline experiments for each. Our results show that there is still ample opportunity for improvement, demonstrating the importance of building stronger dialogue systems that can reason over the complex setting of informationseeking dialogue grounded on tables and text.

Mitigating Covertly Unsafe Text within Natural Language Systems
Alex Mei | Anisha Kabir | Sharon Levy | Melanie Subbiah | Emily Allaway | John Judge | Desmond Patton | Bruce Bimber | Kathleen McKeown | William Yang Wang
Findings of the Association for Computational Linguistics: EMNLP 2022

An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system’s information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems.

SafeText: A Benchmark for Exploring Physical Safety in Language Models
Sharon Levy | Emily Allaway | Melanie Subbiah | Lydia Chilton | Desmond Patton | Kathleen McKeown | William Yang Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create the first benchmark dataset, SafeText, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. We utilize SafeText to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. As a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release.


Investigating Memorization of Conspiracy Theories in Text Generation
Sharon Levy | Michael Saxon | William Yang Wang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Modeling Disclosive Transparency in NLP Application Descriptions
Michael Saxon | Sharon Levy | Xinyi Wang | Alon Albalak | William Yang Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Broader disclosive transparency—truth and clarity in communication regarding the function of AI systems—is widely considered desirable. Unfortunately, it is a nebulous concept, difficult to both define and quantify. This is problematic, as previous work has demonstrated possible trade-offs and negative consequences to disclosive transparency, such as a confusion effect, where “too much information” clouds a reader’s understanding of what a system description means. Disclosive transparency’s subjective nature has rendered deep study into these problems and their remedies difficult. To improve this state of affairs, We introduce neural language model-based probabilistic metrics to directly model disclosive transparency, and demonstrate that they correlate with user and expert opinions of system transparency, making them a valid objective proxy. Finally, we demonstrate the use of these metrics in a pilot study quantifying the relationships between transparency, confusion, and user perceptions in a corpus of real NLP system descriptions.

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains
Sharon Levy | Kevin Mo | Wenhan Xiong | William Yang Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Since late 2019, COVID-19 has quickly emerged as the newest biomedical domain, resulting in a surge of new information. As with other emergent domains, the discussion surrounding the topic has been rapidly changing, leading to the spread of misinformation. This has created the need for a public space for users to ask questions and receive credible, scientific answers. To fulfill this need, we turn to the task of open-domain question-answering, which we can use to efficiently find answers to free-text questions from a large set of documents. In this work, we present such a system for the emergent domain of COVID-19. Despite the small data size available, we are able to successfully train the system to retrieve answers from a large-scale corpus of published COVID-19 scientific papers. Furthermore, we incorporate effective re-ranking and question-answering techniques, such as document diversity and multiple answer spans. Our open-domain question-answering system can further act as a model for the quick development of similar systems that can be adapted and modified for other developing emergent domains.


Cross-lingual Transfer Learning for COVID-19 Outbreak Alignment
Sharon Levy | William Yang Wang
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

The spread of COVID-19 has become a significant and troubling aspect of society in 2020. With millions of cases reported across countries, new outbreaks have occurred and followed patterns of previously affected areas. Many disease detection models do not incorporate the wealth of social media data that can be utilized for modeling and predicting its spread. It is useful to ask, can we utilize this knowledge in one country to model the outbreak in another? To answer this, we propose the task of cross-lingual transfer learning for epidemiological alignment. Utilizing both macro and micro text features, we train on Italy’s early COVID-19 outbreak through Twitter and transfer to several other countries. Our experiments show strong results with up to 0.85 Spearman correlation in cross-country predictions.

Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection
Kai Nakamura | Sharon Levy | William Yang Wang
Proceedings of the Twelfth Language Resources and Evaluation Conference

Fake news has altered society in negative ways in politics and culture. It has adversely affected both online social network systems as well as offline communities and conversations. Using automatic machine learning classification models is an efficient way to combat the widespread dissemination of fake news. However, a lack of effective, comprehensive datasets has been a problem for fake news research and detection model development. Prior fake news datasets do not provide multimodal text and image data, metadata, comment data, and fine-grained fake news categorization at the scale and breadth of our dataset. We present Fakeddit, a novel multimodal dataset consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision. We construct hybrid text+image models and perform extensive experiments for multiple variations of classification, demonstrating the importance of the novel aspect of multimodality and fine-grained classification unique to Fakeddit.

Investigating African-American Vernacular English in Transformer-Based Text Generation
Sophie Groenwold | Lily Ou | Aesha Parekh | Samhita Honnavalli | Sharon Levy | Diba Mirza | William Yang Wang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The growth of social media has encouraged the written use of African American Vernacular English (AAVE), which has traditionally been used only in oral contexts. However, NLP models have historically been developed using dominant English varieties, such as Standard American English (SAE), due to text corpora availability. We investigate the performance of GPT-2 on AAVE text by creating a dataset of intent-equivalent parallel AAVE/SAE tweet pairs, thereby isolating syntactic structure and AAVE- or SAE-specific language for each pair. We evaluate each sample and its GPT-2 generated text with pretrained sentiment classifiers and find that while AAVE text results in more classifications of negative sentiment than SAE, the use of GPT-2 generally increases occurrences of positive sentiment for both. Additionally, we conduct human evaluation of AAVE and SAE text generated with GPT-2 to compare contextual rigor and overall quality.