Amélie Reymond
2026
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
Saloni Dash | Amélie Reymond | Emma Spiro | Aylin Caliskan
Findings of the Association for Computational Linguistics: ACL 2026
Saloni Dash | Amélie Reymond | Emma Spiro | Aylin Caliskan
Findings of the Association for Computational Linguistics: ACL 2026
Reasoning in humans is prone to biases due to underlying motivations like identity protection, that undermine rational decision-making and judgment. This motivated reasoning at a collective level can be detrimental to society when debating critical issues such as human-driven climate change or vaccine safety, and can further aggravate political polarization. Prior studies have reported that large language models (LLMs) are also susceptible to human-like cognitive biases, however, the extent to which LLMs selectively reason toward identity-congruent conclusions remains largely unexplored. Here, we investigate whether assigning 8 personas across 4 political and socio-demographic attributes induces motivated reasoning in LLMs. Testing 8 LLMs (open source and proprietary) across two reasoning tasks from human-subject studies — veracity discernment of misinformation headlines and evaluation of numeric scientific evidence — we find that persona-assigned LLMs have up to 9% reduced veracity discernment relative to models without personas. Political personas specifically are up to 90% more likely to correctly evaluate scientific evidence on gun control when the ground truth is congruent with their induced political identity. Prompt-based debiasing methods are largely ineffective at mitigating these effects. Taken together, our empirical findings are the first to suggest that persona-assigned LLMs exhibit human-like motivated reasoning that is hard to mitigate through conventional debiasing prompts — raising concerns of exacerbating identity-congruent reasoning in both LLMs and humans.
2025
Quriosity: Analyzing Human Questioning Behavior and Causal Inquiry through Curiosity-Driven Queries
Roberto Ceraolo | Dmitrii Kharlapenko | Ahmad Khan | Amélie Reymond | Rada Mihalcea | Bernhard Schölkopf | Mrinmaya Sachan | Zhijing Jin
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Roberto Ceraolo | Dmitrii Kharlapenko | Ahmad Khan | Amélie Reymond | Rada Mihalcea | Bernhard Schölkopf | Mrinmaya Sachan | Zhijing Jin
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Recent progress in Large Language Model (LLM) technology has changed our role in interacting with these models. Instead of primarily testing these models with questions we already know answers to, we are now using them for queries where the answers are unknown to us, driven by human curiosity. This shift highlights the growing need to understand curiosity-driven human questions – those that are more complex, open-ended, and reflective of real-world needs. To this end, we present Quriosity, a collection of 13K naturally occurring questions from three diverse sources: human-to-search-engine queries, human-to-human interactions, and human-to-LLM conversations. Our comprehensive collection enables a rich understanding of human curiosity across various domains and contexts. Our analysis reveals a significant presence of causal questions (up to 42%) in the dataset, for which we develop an iterative prompt improvement framework to identify all causal queries and examine their unique linguistic properties, cognitive complexity and source distribution. We also lay the groundwork for exploring efficient identifiers of causal questions, providing six efficient classification models.
2023
mSCAN: A Dataset for Multilingual Compositional Generalisation Evaluation
Amélie Reymond | Shane Steinert-Threlkeld
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP
Amélie Reymond | Shane Steinert-Threlkeld
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP
Language models achieve remarkable results on a variety of tasks, yet still struggle on compositional generalisation benchmarks. The majority of these benchmarks evaluate performance in English only, leaving us with the question of whether these results generalise to other languages. As an initial step to answering this question, we introduce mSCAN, a multilingual adaptation of the SCAN dataset. It was produced by a rule-based translation, developed in cooperation with native speakers. We then showcase this novel dataset on some in-context learning experiments, and GPT3.5 and the multilingual large language model BLOOM as well as gpt3.5-turbo.