Check Your Work: Structured Checklist Feedback for Improving Large Language Models
Jonathan Cook, Tim Rockt\"aschel, Jakob Nicolaus Foerster, Dennis Aumiller, Alex Wang
Abstract
Much recent progress in Large Language Model (LLM) performance has been driven by verifiable feedback in deterministic domains like mathematics and code. However, scaling reinforcement learning (RL) and test-time compute in domains for which strict verification is infeasible remains a challenge. A common approach is to use an LLM-as-judge, which often relies on opaque, monolithic scores. In this work, we propose that AI feedback is most effective when decomposed into granular, prompt-specific checklists. To transform these checklists into a scalar reward, we introduce DIVA: DIscriminative VAriance weighting, a dynamic aggregation scheme that prioritises checklist items based on their ability to distinguish quality across a candidate pool. This ensures the reward signal focuses on the most salient criteria for a given prompt and response group, rather than being diluted by trivial or redundant constraints. Our approach yields an 11.8% win-rate improvement on AlpacaEval 2.0 using Qwen3-8B, outperforming holistic reward models and existing checklist baselines. Beyond training, we show that these checklists serve as a structured policy improvement operator at inference time - by using the model’s own checklist evaluation as localised contextual feedback, the model can iteratively refine its output. This self-correction mechanism outperforms free-form sequential self-correction, offering a unified and interpretable framework for scaling both training-time and test-time performance in domains lacking strict verifiers.- Anthology ID:
- 2026.acl-long.759
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 16649–16688
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.759/
- DOI:
- Cite (ACL):
- Jonathan Cook, Tim Rockt\"aschel, Jakob Nicolaus Foerster, Dennis Aumiller, and Alex Wang. 2026. Check Your Work: Structured Checklist Feedback for Improving Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16649–16688, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Check Your Work: Structured Checklist Feedback for Improving Large Language Models (Cook et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.759.pdf