2025
pdf
bib
abs
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning
Peiqi Sui
|
Juan Diego Rodriguez
|
Philippe Laban
|
J. Dean Murphy
|
Joseph P. Dexter
|
Richard Jean So
|
Samuel Baker
|
Pramit Chaudhuri
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, where they gather textual details from which to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.
2024
pdf
bib
abs
Confabulation: The Surprising Value of Large Language Model Hallucinations
Peiqi Sui
|
Eamon Duede
|
Sophie Wu
|
Richard So
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper presents a systematic defense of large language model (LLM) hallucinations or ‘confabulations’ as a potential resource instead of a categorically negative pitfall. The standard view is that confabulations are inherently problematic and AI research should eliminate this flaw. In this paper, we argue and empirically demonstrate that measurable semantic characteristics of LLM confabulations mirror a human propensity to utilize increased narrativity as a cognitive resource for sense-making and communication. In other words, it has potential value. Specifically, we analyze popular hallucination benchmarks and reveal that hallucinated outputs display increased levels of narrativity and semantic coherence relative to veridical outputs. This finding reveals a tension in our usually dismissive understandings of confabulation. It suggests, counter-intuitively, that the tendency for LLMs to confabulate may be intimately associated with a positive capacity for coherent narrative-text generation.
2023
pdf
bib
abs
Storyline-Centric Detection of Aphasia and Dysarthria in Stroke Patient Transcripts
Peiqi Sui
|
Kelvin Wong
|
Xiaohui Yu
|
John Volpi
|
Stephen Wong
Proceedings of the 5th Clinical Natural Language Processing Workshop
Aphasia and dysarthria are both common symptoms of stroke, affecting around 30% and 50% of acute ischemic stroke patients. In this paper, we propose a storyline-centric approach to detect aphasia and dysarthria in acute stroke patients using transcribed picture descriptions alone. Our pipeline enriches the training set with healthy data to address the lack of acute stroke patient data and utilizes knowledge distillation to significantly improve upon a document classification baseline, achieving an AUC of 0.814 (aphasia) and 0.764 (dysarthria) on a patient-only validation set.
pdf
bib
abs
Mrs. Dalloway Said She Would Segment the Chapters Herself
Peiqi Sui
|
Lin Wang
|
Sil Hamilton
|
Thorsten Ries
|
Kelvin Wong
|
Stephen Wong
Proceedings of the 5th Workshop on Narrative Understanding
This paper proposes a sentiment-centric pipeline to perform unsupervised plot extraction on non-linear novels like Virginia Woolf’s Mrs. Dalloway, a novel widely considered to be “plotless. Combining transformer-based sentiment analysis models with statistical testing, we model sentiment’s rate-of-change and correspondingly segment the novel into emotionally self-contained units qualitatively evaluated to be meaningful surrogate pseudo-chapters. We validate our findings by evaluating our pipeline as a fully unsupervised text segmentation model, achieving a F-1 score of 0.643 (regional) and 0.214 (exact) in chapter break prediction on a validation set of linear novels with existing chapter structures. In addition, we observe notable differences between the distributions of predicted chapter lengths in linear and non-linear fictional narratives, with the latter exhibiting significantly greater variability. Our results hold significance for narrative researchers appraising methods for extracting plots from non-linear novels.