PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Xiangfeng Wang; Hangyu Guo; Yanlin Lai; Mitt Huang; Liang Zhao (赵亮); Chengyuan Yao; Yinmin Zhang; Qi Han; Xiaoxiaoren; Chun Yuan; Tong Xu; Zheng Ge; Xiangyu Zhang; Daxin Jiang

PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering

Xiangfeng Wang, Hangyu Guo, Yanlin Lai, Mitt Huang, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Xiaoxiaoren, Chun Yuan, Tong Xu, Zheng Ge, Xiangyu Zhang, Daxin Jiang

Abstract

While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce **PRIME**, a benchmark for evaluating verifiers on **PR**ocess-outcome alignment verification **I**n **M**athematics and **E**ngineering. Curated from a comprehensive collection of college-level STEM problems, **PRIME** comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via **PRIME**. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of **8.29%**, **9.12%**, and **7.31%** on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation (R² > 0.92) between verifier accuracy on **PRIME** and RLVR training effectiveness, validating **PRIME** as a reliable predictor for verifier selection.

Anthology ID:: 2026.acl-long.683
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14973–14985
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.683/
DOI:
Bibkey:
Cite (ACL):: Xiangfeng Wang, Hangyu Guo, Yanlin Lai, Mitt Huang, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Xiaoxiaoren, Chun Yuan, Tong Xu, Zheng Ge, Xiangyu Zhang, and Daxin Jiang. 2026. PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14973–14985, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.683.pdf
Checklist:: 2026.acl-long.683.checklist.pdf

PDF Cite Search Checklist Fix data