Yinmin Zhang
2026
PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
Xiangfeng Wang | Hangyu Guo | Yanlin Lai | Mitt Huang | Liang Zhao | Chengyuan Yao | Yinmin Zhang | Qi Han | Xiaoxiaoren | Chun Yuan | Tong Xu | Zheng Ge | Xiangyu Zhang | Daxin Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiangfeng Wang | Hangyu Guo | Yanlin Lai | Mitt Huang | Liang Zhao | Chengyuan Yao | Yinmin Zhang | Qi Han | Xiaoxiaoren | Chun Yuan | Tong Xu | Zheng Ge | Xiangyu Zhang | Daxin Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce **PRIME**, a benchmark for evaluating verifiers on **PR**ocess-outcome alignment verification **I**n **M**athematics and **E**ngineering. Curated from a comprehensive collection of college-level STEM problems, **PRIME** comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via **PRIME**. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of **8.29%**, **9.12%**, and **7.31%** on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation (R2 > 0.92) between verifier accuracy on **PRIME** and RLVR training effectiveness, validating **PRIME** as a reliable predictor for verifier selection.
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
Jingcheng Hu | Yinmin Zhang | Shijie Shang | Xiaobo Yang | Yue Peng | Zhewei Huang | Hebin Zhou | Xin Wu | Jie Cheng | Fanqi Wan | Xiangwen Kong | Chengyuan Yao | Kaiwen Yan | Ailin Huang | Hongyu Zhou | Qi Han | Zheng Ge | Xiangyu Zhang | Heung-Yeung Shum
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingcheng Hu | Yinmin Zhang | Shijie Shang | Xiaobo Yang | Yue Peng | Zhewei Huang | Hebin Zhou | Xin Wu | Jie Cheng | Fanqi Wan | Xiangwen Kong | Chengyuan Yao | Kaiwen Yan | Ailin Huang | Hongyu Zhou | Qi Han | Zheng Ge | Xiangyu Zhang | Heung-Yeung Shum
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5’s 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.
2025
Multi-matrix Factorization Attention
Jingcheng Hu | Houyi Li | Yinmin Zhang | Zili Wang | Shuigeng Zhou | Xiangyu Zhang | Heung-Yeung Shum
Findings of the Association for Computational Linguistics: ACL 2025
Jingcheng Hu | Houyi Li | Yinmin Zhang | Zili Wang | Shuigeng Zhou | Xiangyu Zhang | Heung-Yeung Shum
Findings of the Association for Computational Linguistics: ACL 2025
We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA’s design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.
Search
Fix author
Co-authors
- Xiangyu Zhang 3
- Zheng Ge 2
- Qi Han 2
- Jingcheng Hu 2
- Heung Yeung Shum 2
- Chengyuan Yao 2
- Jie Cheng 1
- Hangyu Guo 1
- Ailin Huang 1
- Mitt Huang 1
- Zhewei Huang 1
- Daxin Jiang 1
- Xiangwen Kong 1
- Yanlin Lai 1
- Houyi Li 1
- Yue Peng 1
- Shijie Shang 1
- Fanqi Wan 1
- Xiangfeng Wang 1
- Zili Wang 1
- Xin Wu 1
- Xiaoxiaoren 1
- Tong Xu 1
- Kaiwen Yan 1
- Xiaobo Yang 1
- Chun Yuan 1
- Liang Zhao (赵亮) 1
- Hebin Zhou 1
- Hongyu Zhou 1
- Shuigeng Zhou 1