Tim Rockt\"aschel

2026

Check Your Work: Structured Checklist Feedback for Improving Large Language Models
Jonathan Cook | Tim Rockt\"aschel | Jakob Nicolaus Foerster | Dennis Aumiller | Alex Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Much recent progress in Large Language Model (LLM) performance has been driven by verifiable feedback in deterministic domains like mathematics and code. However, scaling reinforcement learning (RL) and test-time compute in domains for which strict verification is infeasible remains a challenge. A common approach is to use an LLM-as-judge, which often relies on opaque, monolithic scores. In this work, we propose that AI feedback is most effective when decomposed into granular, prompt-specific checklists. To transform these checklists into a scalar reward, we introduce DIVA: DIscriminative VAriance weighting, a dynamic aggregation scheme that prioritises checklist items based on their ability to distinguish quality across a candidate pool. This ensures the reward signal focuses on the most salient criteria for a given prompt and response group, rather than being diluted by trivial or redundant constraints. Our approach yields an 11.8% win-rate improvement on AlpacaEval 2.0 using Qwen3-8B, outperforming holistic reward models and existing checklist baselines. Beyond training, we show that these checklists serve as a structured policy improvement operator at inference time - by using the model’s own checklist evaluation as localised contextual feedback, the model can iteratively refine its output. This self-correction mechanism outperforms free-form sequential self-correction, offering a unified and interpretable framework for scaling both training-time and test-time performance in domains lacking strict verifiers.

Co-authors

Venues

ACL1

Fix author