Scaling Evaluation-Time Compute with Reasoning Models as Evaluators

Seungone Kim; Ian Wu; Jinu Lee; Xiang Yue; Seongyun Lee; Minkyeong Moon; Carolin Lawrence; Kiril Gashteovski; Julia Hockenmaier; Graham Neubig; Sean Welleck

Scaling Evaluation-Time Compute with Reasoning Models as Evaluators

Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Minkyeong Moon, Carolin Lawrence, Kiril Gashteovski, Julia Hockenmaier, Graham Neubig, Sean Welleck

Abstract

Language model (LM) evaluators that generate chain-of-thought (CoT) reasoning are widely used for the assessment of LM responses. Simultaneously, increasing LMs’ "thinking" time through scaling test-time compute has proven to be an effective technique for solving challenging problems in domains such as math and code. This raises a natural question: can an LM’s evaluation capability also be improved by scaling test-time compute? To answer this, we investigate employing reasoning models - LMs that natively generate long CoT reasoning - as evaluators. We explore scaling evaluation-time compute by using reasoning models to evaluate both the overall candidate response (i.e., outcome evaluation) and the individual reasoning steps within it (i.e., process evaluation). We observe that evaluator performance improves monotonically with the number of reasoning tokens generated, mirroring trends seen in LM reasoning. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as increasing compute during generation for improving an LM’s problem-solving performance.

Anthology ID:: 2026.findings-acl.2102
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 42354–42384
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2102/
DOI:
Bibkey:
Cite (ACL):: Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Minkyeong Moon, Carolin Lawrence, Kiril Gashteovski, Julia Hockenmaier, Graham Neubig, and Sean Welleck. 2026. Scaling Evaluation-Time Compute with Reasoning Models as Evaluators. In Findings of the Association for Computational Linguistics: ACL 2026, pages 42354–42384, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Scaling Evaluation-Time Compute with Reasoning Models as Evaluators (Kim et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2102.pdf
Checklist:: 2026.findings-acl.2102.checklist.pdf

PDF Cite Search Checklist Fix data