NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

Wenqing Wu; Yi Zhao; Yuzhuo Wang; Siyou Li; Juexi Shao; Yunfei Long; Chengzhi Zhang

NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

Wenqing Wu, Yi Zhao, Yuzhuo Wang, Siyou Li, Juexi Shao, Yunfei Long, Chengzhi Zhang

Abstract

Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promising results in generating review comments, the absence of dedicated benchmarks has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs’ capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper–review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine-tuned models often suffer from instruction-following deficiencies. Our findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

Anthology ID:: 2026.findings-acl.1607
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32103–32133
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1607/
DOI:
Bibkey:
Cite (ACL):: Wenqing Wu, Yi Zhao, Yuzhuo Wang, Siyou Li, Juexi Shao, Yunfei Long, and Chengzhi Zhang. 2026. NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 32103–32133, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment (Wu et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1607.pdf
Checklist:: 2026.findings-acl.1607.checklist.pdf

PDF Cite Search Checklist Fix data