Kuleen Sasse


2025

pdf bib
Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats
Kuleen Sasse | Carlos Alejandro Aguirre | Isabel Cachola | Sharon Levy | Mark Dredze
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dog whistles are coded expressions with dual meanings: one intended for the general public (outgroup) and another that conveys a specific message to an intended audience (ingroup). Often, these expressions are used to convey controversial political opinions while maintaining plausible deniability and slip by content moderation filters. Identification of dog whistles relies on curated lexicons, which have trouble keeping up to date. We introduce FETCH!, a task for finding novel dog whistles in massive social media corpora. We find that state-of-the-art systems fail to achieve meaningful results across three distinct social media case studies. We present EarShot, a strong baseline system that combines the strengths of vector databases and Large Language Models (LLMs) to efficiently and effectively identify new dog whistles.

2024

pdf bib
Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models
Carlos Alejandro Aguirre | Kuleen Sasse | Isabel Alyssa Cachola | Mark Dredze
Proceedings of the Third Workshop on NLP for Positive Impact

Recently, work in NLP has shifted to few-shot (in-context) learning, with large language models (LLMs) performing well across a range of tasks. However, while fairness evaluations have become a standard for supervised methods, little is known about the fairness of LLMs as prediction systems. Further, common standard methods for fairness involve access to model weights or are applied during finetuning, which are not applicable in few-shot learning. Do LLMs exhibit prediction biases when used for standard NLP tasks?In this work, we analyze the effect of shots, which directly affect the performance of models, on the fairness of LLMs as NLP classification systems. We consider how different shot selection strategies, both existing and new demographically sensitive methods, affect model fairness across three standard fairness datasets. We find that overall the performance of LLMs is not indicative of their fairness, and there is not a single method that fits all scenarios. In light of these facts, we discuss how future work can include LLM fairness in evaluations.

2023

pdf bib
To Burst or Not to Burst: Generating and Quantifying Improbable Text
Kuleen Sasse | Efsun Sarioglu Kayi | Samuel Barham | Edward Staley
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

While large language models (LLMs) are extremely capable at text generation, their outputs are still distinguishable from human-authored text. We explore this separation across many metrics over text, many sampling techniques, many types of text data, and across two popular LLMs, LLaMA and Vicuna. Along the way, we introduce a new metric, recoverability, to highlight differences between human and machine text; and we propose a new sampling technique, burst sampling, designed to close this gap. We find that LLaMA and Vicuna have distinct distributions under many of the metrics, and that this influences our results: Recoverability separates real from fake text better than any other metric when using LLaMA. When using Vicuna, burst sampling produces text which is distributionally closer to real text compared to other sampling techniques.