Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers

Melanie Subbiah, Sean Zhang, Lydia B. Chilton, Kathleen McKeown


Abstract
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.
Anthology ID:
2024.tacl-1.71
Volume:
Transactions of the Association for Computational Linguistics, Volume 12
Month:
Year:
2024
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
1290–1310
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2024.tacl-1.71/
DOI:
10.1162/tacl_a_00702
Bibkey:
Cite (ACL):
Melanie Subbiah, Sean Zhang, Lydia B. Chilton, and Kathleen McKeown. 2024. Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers. Transactions of the Association for Computational Linguistics, 12:1290–1310.
Cite (Informal):
Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers (Subbiah et al., TACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2024.tacl-1.71.pdf