Check Your Work: Structured Checklist Feedback for Improving Large Language Models

Jonathan Cook; Tim Rockt\"aschel; Jakob Nicolaus Foerster; Dennis Aumiller; Alex Wang

Check Your Work: Structured Checklist Feedback for Improving Large Language Models

Jonathan Cook, Tim Rockt\"aschel, Jakob Nicolaus Foerster, Dennis Aumiller, Alex Wang

Abstract

Much recent progress in Large Language Model (LLM) performance has been driven by verifiable feedback in deterministic domains like mathematics and code. However, scaling reinforcement learning (RL) and test-time compute in domains for which strict verification is infeasible remains a challenge. A common approach is to use an LLM-as-judge, which often relies on opaque, monolithic scores. In this work, we propose that AI feedback is most effective when decomposed into granular, prompt-specific checklists. To transform these checklists into a scalar reward, we introduce DIVA: DIscriminative VAriance weighting, a dynamic aggregation scheme that prioritises checklist items based on their ability to distinguish quality across a candidate pool. This ensures the reward signal focuses on the most salient criteria for a given prompt and response group, rather than being diluted by trivial or redundant constraints. Our approach yields an 11.8% win-rate improvement on AlpacaEval 2.0 using Qwen3-8B, outperforming holistic reward models and existing checklist baselines. Beyond training, we show that these checklists serve as a structured policy improvement operator at inference time - by using the model’s own checklist evaluation as localised contextual feedback, the model can iteratively refine its output. This self-correction mechanism outperforms free-form sequential self-correction, offering a unified and interpretable framework for scaling both training-time and test-time performance in domains lacking strict verifiers.

Anthology ID:: 2026.acl-long.759
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16649–16688
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.759/
DOI:
Bibkey:
Cite (ACL):: Jonathan Cook, Tim Rockt\"aschel, Jakob Nicolaus Foerster, Dennis Aumiller, and Alex Wang. 2026. Check Your Work: Structured Checklist Feedback for Improving Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16649–16688, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Check Your Work: Structured Checklist Feedback for Improving Large Language Models (Cook et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.759.pdf
Checklist:: 2026.acl-long.759.checklist.pdf

PDF Cite Search Checklist Fix data