Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

Raffaele Pisano; Roberto Navigli

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

Abstract

Process Reward Models (PRMs) have emerged as a powerful tool for providing step-level feedback when evaluating the reasoning of Large Language Models (LLMs), which frequently produce chains of thought (CoTs) containing errors even when the final answer is correct. However, existing PRM datasets remain expensive to construct, prone to annotation errors, and predominantly limited to the mathematical domain.This work introduces a novel and scalable approach to PRM dataset generation based on planning logical problems expressed in the Planning Domain Definition Language (PDDL). Using this method, we generate a corpus of approximately one million reasoning steps across various PDDL domains and use it to train PRMs. Experimental results show that augmenting widely-used PRM training datasets with PDDL-derived data yields substantial improvements in both mathematical and non-mathematical reasoning, as demonstrated across multiple benchmarks.These findings indicate that planning problems constitute a scalable and effective resource for generating robust, precise, and fine-grained training data for PRMs, going beyond the classical mathematical sources that dominate this field.

Anthology ID:: 2026.acl-long.1292
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 28022–28042
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1292/
DOI:
Bibkey:
Cite (ACL):: Raffaele Pisano and Roberto Navigli. 2026. Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28022–28042, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards (Pisano & Navigli, ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1292.pdf
Checklist:: 2026.acl-long.1292.checklist.pdf

PDF Cite Search Checklist Fix data