Kellie Webster


2023

pdf
Query Refinement Prompts for Closed-Book Long-Form QA
Reinald Kim Amplayo | Kellie Webster | Michael Collins | Dipanjan Das | Shashi Narayan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have been shown to perform well in answering questions and in producing long-form texts, both in few-shot closed-book settings. While the former can be validated using well-known evaluation metrics, the latter is difficult to evaluate. We resolve the difficulties to evaluate long-form output by doing both tasks at once – to do question answering that requires long-form answers. Such questions tend to be multifaceted, i.e., they may have ambiguities and/or require information from multiple sources. To this end, we define query refinement prompts that encourage LLMs to explicitly express the multifacetedness in questions and generate long-form answers covering multiple facets of the question. Our experiments on two long-form question answering datasets, ASQA and AQuAMuSe, show that using our prompts allows us to outperform fully finetuned models in the closed book setting, as well as achieve results comparable to retrieve-then-generate open-book models.

2022

pdf
Flexible text generation for counterfactual fairness probing
Zee Fryer | Vera Axelrod | Ben Packer | Alex Beutel | Jilin Chen | Kellie Webster
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

A common approach for testing fairness issues in text-based classifiers is through the use of counterfactuals: does the classifier output change if a sensitive attribute in the input is changed? Existing counterfactual generation methods typically rely on wordlists or templates, producing simple counterfactuals that fail to take into account grammar, context, or subtle sensitive attribute references, and could miss issues that the wordlist creators had not considered. In this paper, we introduce a task for generating counterfactuals that overcomes these shortcomings, and demonstrate how large language models (LLMs) can be leveraged to accomplish this task. We show that this LLM-based method can produce complex counterfactuals that existing methods cannot, comparing the performance of various counterfactual generation methods on the Civil Comments dataset and showing their value in evaluating a toxicity classifier.

2021

pdf
Toward Deconfounding the Effect of Entity Demographics for Question Answering Accuracy
Maharshi Gor | Kellie Webster | Jordan Boyd-Graber
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The goal of question answering (QA) is to answer _any_ question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, an analysis of model accuracy reveals little evidence that accuracy is lower for people based on gender or nationality; instead, there is more variation on professions (question topic) and question ambiguity. But QA’s lack of representation could itself hide evidence of bias, necessitating QA datasets that better represent global diversity.

pdf bib
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing
Marta Costa-jussa | Hila Gonen | Christian Hardmeier | Kellie Webster
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing

2020

pdf
Social Biases in NLP Models as Barriers for Persons with Disabilities
Ben Hutchinson | Vinodkumar Prabhakaran | Emily Denton | Kellie Webster | Yu Zhong | Stephen Denuyl
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Building equitable and inclusive NLP technologies demands consideration of whether and how social attitudes are represented in ML models. In particular, representations encoded in models often inadvertently perpetuate undesirable social biases from the data on which they are trained. In this paper, we present evidence of such undesirable biases towards mentions of disability in two different English language models: toxicity prediction and sentiment analysis. Next, we demonstrate that the neural embeddings that are the critical first step in most NLP pipelines similarly contain undesirable biases towards mentions of disability. We end by highlighting topical biases in the discourse about disability which may contribute to the observed model biases; for instance, gun violence, homelessness, and drug addiction are over-represented in texts discussing mental illness.

pdf
Automatically Identifying Gender Issues in Machine Translation using Perturbations
Hila Gonen | Kellie Webster
Findings of the Association for Computational Linguistics: EMNLP 2020

The successful application of neural methods to machine translation has realized huge quality advances for the community. With these improvements, many have noted outstanding challenges, including the modeling and treatment of gendered language. While previous studies have identified issues using synthetic examples, we develop a novel technique to mine examples from real world data to explore challenges for deployed systems. We use our method to compile an evaluation benchmark spanning examples for four languages from three language families, which we publicly release to facilitate research. The examples in our benchmark expose where model representations are gendered, and the unintended consequences these gendered representations can have in downstream application.

pdf bib
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing
Marta R. Costa-jussà | Christian Hardmeier | Will Radford | Kellie Webster
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

pdf
Type B Reflexivization as an Unambiguous Testbed for Multilingual Multi-Task Gender Bias
Ana Valeria González | Maria Barrett | Rasmus Hvingelby | Kellie Webster | Anders Søgaard
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The one-sided focus on English in previous studies of gender bias in NLP misses out on opportunities in other languages: English challenge datasets such as GAP and WinoGender highlight model preferences that are “hallucinatory”, e.g., disambiguating gender-ambiguous occurrences of ‘doctor’ as male doctors. We show that for languages with type B reflexivization, e.g., Swedish and Russian, we can construct multi-task challenge datasets for detecting gender bias that lead to unambiguously wrong model predictions: In these languages, the direct translation of ‘the doctor removed his mask’ is not ambiguous between a coreferential reading and a disjoint reading. Instead, the coreferential reading requires a non-gendered pronoun, and the gendered, possessive pronouns are anti-reflexive. We present a multilingual, multi-task challenge dataset, which spans four languages and four NLP tasks and focuses only on this phenomenon. We find evidence for gender bias across all task-language combinations and correlate model bias with national labor market statistics.

2019

pdf bib
Proceedings of the First Workshop on Gender Bias in Natural Language Processing
Marta R. Costa-jussà | Christian Hardmeier | Will Radford | Kellie Webster
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

pdf bib
Gendered Ambiguous Pronoun (GAP) Shared Task at the Gender Bias in NLP Workshop 2019
Kellie Webster | Marta R. Costa-jussà | Christian Hardmeier | Will Radford
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

The 1st ACL workshop on Gender Bias in Natural Language Processing included a shared task on gendered ambiguous pronoun (GAP) resolution. This task was based on the coreference challenge defined in Webster et al. (2018), designed to benchmark the ability of systems to resolve pronouns in real-world contexts in a gender-fair way. 263 teams competed via a Kaggle competition, with the winning system achieving logloss of 0.13667 and near gender parity. We review the approaches of eleven systems with accepted description papers, noting their effective use of BERT (Devlin et al., 2018), both via fine-tuning and for feature extraction, as well as ensembling.

2018

pdf
Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns
Kellie Webster | Marta Recasens | Vera Axelrod | Jason Baldridge
Transactions of the Association for Computational Linguistics, Volume 6

Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To address this, we present and release GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun–name pairs sampled to provide diverse coverage of challenges posed by real-world text. We explore a range of baselines that demonstrate the complexity of the challenge, the best achieving just 66.9% F1. We show that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.

pdf
A Challenge Set and Methods for Noun-Verb Ambiguity
Ali Elkahky | Kellie Webster | Daniel Andor | Emily Pitler
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

English part-of-speech taggers regularly make egregious errors related to noun-verb ambiguity, despite having achieved 97%+ accuracy on the WSJ Penn Treebank since 2002. These mistakes have been difficult to quantify and make taggers less useful to downstream tasks such as translation and text-to-speech synthesis. This paper creates a new dataset of over 30,000 naturally-occurring non-trivial examples of noun-verb ambiguity. Taggers within 1% of each other when measured on the WSJ have accuracies ranging from 57% to 75% accuracy on this challenge set. Enhancing the strongest existing tagger with contextual word embeddings and targeted training data improves its accuracy to 89%, a 14% absolute (52% relative) improvement. Downstream, using just this enhanced tagger yields a 28% reduction in error over the prior best learned model for homograph disambiguation for textto-speech synthesis.

2016

pdf
Using mention accessibility to improve coreference resolution
Kellie Webster | Joel Nothman
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf bib
Proceedings of the Australasian Language Technology Association Workshop 2015
Ben Hachey | Kellie Webster
Proceedings of the Australasian Language Technology Association Workshop 2015

2014

pdf
Limited memory incremental coreference resolution
Kellie Webster | James R. Curran
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf
Examining the Impact of Coreference Resolution on Quote Attribution
Tim O’Keefe | Kellie Webster | James R. Curran | Irena Koprinska
Proceedings of the Australasian Language Technology Association Workshop 2013 (ALTA 2013)