Towards A “Novel” Benchmark: Evaluating Literary Fiction with Large Language Models

Wenqing Wang; Mingqi Gao; Xinyu Hu; Xiaojun Wan

doi:10.18653/v1/2025.findings-acl.1114

Towards A “Novel” Benchmark: Evaluating Literary Fiction with Large Language Models

Wenqing Wang, Mingqi Gao, Xinyu Hu, Xiaojun Wan

Abstract

Current exploration on creative generation focuses mainly on short stories, poetry, and scripts. With the expansion of Large Language Models (LLMs) context windows, “novel” avenues emerge. This study aims to extend the boundaries of Natural Language Generation (NLG) evaluation by exploring LLMs’ capabilities in more challenging long-form fiction. We propose a new multi-level evaluation framework that incorporates ten metrics across the Macro, Meso, and Micro levels. An annotated fiction dataset, sourced from human authors, LLMs, and human-AI collaborations in both English and Chinese is then constructed. Human evaluation reveals notable disparities between LLM-generated and human-authored fictions, particularly the “high-starting, low-ending” pattern in LLM outputs. We further probe ten high-performing LLMs through different prompt templates, achieving moderate correlations by strategically utilizing diverse LLMs tailored to different levels, as an initial step towards better automatic fiction evaluation. Finally, we offer a fine-grained analysis of LLMs capabilities through six issues, providing promising insights for future advancements.

Anthology ID:: 2025.findings-acl.1114
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21648–21673
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.findings-acl.1114/
DOI:: 10.18653/v1/2025.findings-acl.1114
Bibkey:
Cite (ACL):: Wenqing Wang, Mingqi Gao, Xinyu Hu, and Xiaojun Wan. 2025. Towards A “Novel” Benchmark: Evaluating Literary Fiction with Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 21648–21673, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Towards A “Novel” Benchmark: Evaluating Literary Fiction with Large Language Models (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.findings-acl.1114.pdf

PDF Cite Search Fix data