Jingcheng Hu
2026
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
Jingcheng Hu | Yinmin Zhang | Shijie Shang | Xiaobo Yang | Yue Peng | Zhewei Huang | Hebin Zhou | Xin Wu | Jie Cheng | Fanqi Wan | Xiangwen Kong | Chengyuan Yao | Kaiwen Yan | Ailin Huang | Hongyu Zhou | Qi Han | Zheng Ge | Xiangyu Zhang | Heung-Yeung Shum
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingcheng Hu | Yinmin Zhang | Shijie Shang | Xiaobo Yang | Yue Peng | Zhewei Huang | Hebin Zhou | Xin Wu | Jie Cheng | Fanqi Wan | Xiangwen Kong | Chengyuan Yao | Kaiwen Yan | Ailin Huang | Hongyu Zhou | Qi Han | Zheng Ge | Xiangyu Zhang | Heung-Yeung Shum
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5’s 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.
2025
Multi-matrix Factorization Attention
Jingcheng Hu | Houyi Li | Yinmin Zhang | Zili Wang | Shuigeng Zhou | Xiangyu Zhang | Heung-Yeung Shum
Findings of the Association for Computational Linguistics: ACL 2025
Jingcheng Hu | Houyi Li | Yinmin Zhang | Zili Wang | Shuigeng Zhou | Xiangyu Zhang | Heung-Yeung Shum
Findings of the Association for Computational Linguistics: ACL 2025
We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA’s design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.