Derek Powell


2025

pdf bib
Evaluating Large Language Models for Belief Inference: Mapping Belief Networks at Scale
Trisevgeni Papakonstantinou | Antonina Zhiteneva | Ana Yutong Ma | Derek Powell | Zachary Horne
Findings of the Association for Computational Linguistics: EMNLP 2025

Beliefs are interconnected, influencing how people process and update what they think. To study the interconnectedness of beliefs at scale, we introduce a novel analytical pipeline leveraging a finetuned GPT-4o model to infer belief structures from large-scale social media data. We evaluate the model’s performance by (1) comparing it to human annotated data (2) and its inferences to human-generated survey data. Our results show that a fine-tuned GPT-4o model can effectively recover belief structures, allowing for a level of scalability and efficiency that is impossible using traditional survey methods of data collection. This work demonstrates the potential for large language models to perform belief inference tasks and provides a framework for future research on the analysis of belief structures.

2024

pdf bib
TAXI: Evaluating Categorical Knowledge Editing for Language Models
Derek Powell | Walter Gerych | Thomas Hartvigsen
Findings of the Association for Computational Linguistics: ACL 2024

Humans rarely learn one fact in isolation. Instead, learning a new fact induces knowledge of other facts about the world. For example, in learning a korat is a type of cat, you also infer it is a mammal and has claws, ensuring your model of the world is consistent. Knowledge editing aims to inject new facts into language models to improve their factuality, but current benchmarks fail to evaluate consistency, which is critical to ensure efficient, accurate, and generalizable edits. We manually create TAXI, a new benchmark dataset specifically created to evaluate consistency in categorical knowledge edits. TAXI contains 11,120 multiple-choice queries for 976 edits spanning 41 categories (e.g., Dogs), 164 subjects (e.g., Labrador), and 183 properties (e.g., is a mammal). We then use TAXI to evaluate popular editors’ categorical consistency, measuring how often editing a subject’s category appropriately edits its properties. We find that 1) the editors achieve marginal, yet non-random consistency, 2) their consistency far underperforms human baselines, and 3) consistency is more achievable when editing atypical subjects.