Clara Seyfried

2026

Investigating Reasoning with Hypotheses: The RIP2 Corpus
Ella Schad | Clara Seyfried | Chris Reed
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Analyses of hypothesis generation in fictionalised environments have significant potential for exploring factors influencing reasoning and decision-making in naturalistic contexts. Based on transcripts of 16 groups playing a murder mystery game, with a total of 42 human participants, RIP2 is a 177,000 word corpus exemplifying reasoning in the forensic domain. With a 80,000 word representative sample of the corpus annotated using an argumentation framework, RIP2 is nearly twice the size of the RIP Corpus of Collaborative Hypothesis-Making (RIP1), currently the only existing corpus of hypothesis-making in group environments. With an new experimental set-up and guidelines for annotating both cases of hypothesising and conjecturing, RIP2 offers insight into how participants generate, maintain, and reject hypotheses, as well as how they interact with others’ contributions. Based on its close exploration of six groups (three successful), this corpus particularly allows for group-level comparisons of factors influencing group success. Within this paper, we discuss the main contributions for understanding hypothesising and collaborative reasoning, and offer use cases for extended work demonstrating how analysis of hypothesis generation can be used for future research on argumentation quality and decision-making.

2025

pdf bib abs

Automating Alternative Generation in Decision-Making
Yevhen Kostiuk | Clara Seyfried | Chris Reed
Findings of the Association for Computational Linguistics: EMNLP 2025

In decision making, generating alternative solutions is crucial for solving a problem. However, cognitive biases can impede this process by constraining individual decision makers’ creativity. To address this issue, we introduce a new task for automatically generating alternatives, inspired by the process of human “brainstorming”. We define alternative options based on atomic action components and present a dataset of 106 annotated Reddit r/Advice posts containing unique alternative options extracted from users’ replies. We also introduce new metrics to assess the quality of generated components, including distinctiveness, creativity, upvote-weighted, crowd intersection, and final commit intersection scores. As a baseline, we evaluated the large language models (LLMs) LLaMa3:8b, LLaMa3.1:8b, and Gemma 2:9b on the alternative component generation task. On the one hand, models demonstrated high creativity (ability to generate options beyond what Reddit users suggested) and performed well at proposing distinct alternatives. A subset of generated components was manually evaluated and found overall useful. This indicates that LLMs might be used to extend lists of alternative options, helping decision makers consider a problem from different perspectives. On the other hand, LLMs’ outputs often failed to align with human suggestions, implying that they still tend to miss important components.

Co-authors

Venues

Findings1
LREC1

Fix author