Blase Ur

2025

pdf bib abs
Implicit Values Embedded in How Humans and LLMs Complete Subjective Everyday Tasks
Arjun Arunasalam | Madison Pickering | Z. Berkay Celik | Blase Ur
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) can underpin AI assistants that help users with everyday tasks, such as by making recommendations or performing basic computation. Despite AI assistants’ promise, little is known about the implicit values these assistants display while completing subjective everyday tasks. Humans may consider values like environmentalism, charity, and diversity. To what extent do LLMs exhibit these values in completing everyday tasks? How do they compare with humans? We answer these questions by auditing how six popular LLMs complete 30 everyday tasks, comparing LLMs to each other and to 100 human crowdworkers from the US. We find LLMs often do not align with humans, nor with other LLMs, in the implicit values exhibited.

2022

pdf bib abs
Explaining Why: How Instructions and User Interfaces Impact Annotator Rationales When Labeling Text Data
Jamar Sullivan Jr. | Will Brackenbury | Andrew McNutt | Kevin Bryson | Kwam Byll | Yuxin Chen | Michael Littman | Chenhao Tan | Blase Ur
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In the context of data labeling, NLP researchers are increasingly interested in having humans select rationales, a subset of input tokens relevant to the chosen label. We conducted a 332-participant online user study to understand how humans select rationales, especially how different instructions and user interface affordances impact the rationales chosen. Participants labeled ten movie reviews as positive or negative, selecting words and phrases supporting their label as rationales. We varied the instructions given, the rationale-selection task, and the user interface. Participants often selected about 12% of input tokens as rationales, but selected fewer if unable to drag over multiple tokens at once. Whereas participants were near unanimous in their data labels, they were far less consistent in their rationales. The user interface affordances and task greatly impacted the types of rationales chosen. We also observed large variance across participants.

Co-authors

Venues

emnlp1
naacl1

Fix author