Alexander Miserlis Hoyle

2025

pdf bib abs
How Persuasive Is Your Context?
Tu Nguyen | Kevin Du | Alexander Miserlis Hoyle | Ryan Cotterell
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Two central capabilities of language models (LMs) are: (i) drawing on prior knowledge about entities, which allows them to answer queries such as What’s the official language of Austria?, and (ii) adapting to new information provided in context, e.g., Pretend the official language of Austria is Tagalog., that is pre-pended to the question. In this article, we introduce targeted persuasion score (TPS), designed to quantify how persuasive a given context is to an LM where persuasion is operationalized as the ability of the context to alter the LM’s answer to the question. In contrast to evaluating persuasiveness only through a model’s most likely answer, TPS provides a more fine-grained view of model behavior. Based on the Wasserstein distance, TPS measures how much a context shifts a model’s original answer distribution towarda target distribution. Empirically, through aseries of experiments, we show that TPS captures a more nuanced notion of persuasiveness than previously proposed metrics.

Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just “simple” or “complex”, but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. Our study finds that pairwise comparisons made by LLMs produce better measurements than simply prompting the LLM to directly output the scores, which suffers from bunching around arbitrary numbers. However, taking the weighted mean over the token probability of scores further improves the measurements over the two previous approaches. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.

Co-authors

Venues

emnlp2

Fix author