CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Brihi Joshi; Sriram Venkatapathy; Mohit Bansal; Nanyun Peng; Haw-Shiuan Chang

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Brihi Joshi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, Haw-Shiuan Chang

Abstract

Evaluating creative text such as human-written stories using language models has always been a challenging task – owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (Wei et al., 2023) (CoT) generates free-text explanations that help guide a model’s predictions and Self-Consistency (Wang et al., 2022) (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating ‘fluent-looking’ explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose Chain-of-Keywords (CoKe), which generates a sequence of keywords before generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less # of parameters.

Anthology ID:: 2025.gem-1.31
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Kaustubh Dhole, Miruna Clinciu
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 366–384
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.31/
DOI:
Bibkey:
Cite (ACL):: Brihi Joshi, Sriram Venkatapathy, Mohit Bansal, Nanyun Peng, and Haw-Shiuan Chang. 2025. CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 366–384, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization (Joshi et al., GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.31.pdf

PDF Cite Search Fix data