LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Daniel Fein; Sebastian Russo; Violet Xiang; Kabir Jolly; Rafael Rafailov; Nick Haber

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, Nick Haber

Abstract

Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. To address this gap, we introduce LitBench, a large-scale benchmark for creative writing evaluation, featuring a training corpus of 43,827 story pairs and a 2,480-pair test set curated from Reddit. Using LitBench, we benchmark existing LLM judges and train specialized reward models. Our analysis reveals that the strongest OTS judge, Claude-3.7-Sonnet, achieves only 73% agreement with human preferences. In contrast, our trained Bradley-Terry and generative reward models both reach 78% accuracy, outperforming all OTS judges. An online human study further validates our models, showing their rankings of newly generated stories align more closely with human preferences. Our work provides the first reliable benchmark and specialized reward models for creative writing, establishing a crucial foundation for the future development of more capable verifiers.

Anthology ID:: 2026.eacl-long.362
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7740–7755
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.362/
DOI:
Bibkey:
Cite (ACL):: Daniel Fein, Sebastian Russo, Violet Xiang, Kabir Jolly, Rafael Rafailov, and Nick Haber. 2026. LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7740–7755, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing (Fein et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.362.pdf

PDF Cite Search Fix data