Yingwei MA

Also published as: Yingwei Ma


2026

Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its essentially on-policy strategy coupled with LLM’s immense action space and sparse reward. Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose R-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. R-PLUS integrates two core components, i.e., Multiple Importance Sampling to address distributional mismatch from external data, and Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. Compared with existing RLVR methods, R-PLUS achieves 1) state-of-the-art performance on six math reasoning benchmarks; 2) superior performance on six out-of-distribution reasoning tasks; 3) consistent and significant gains across diverse model families, with average relative improvements up to 69.2%. Moreover, the analysis of Pass@k curves indicates that R-PLUS effectively resolves the capability boundary collapse problem.

2025

The evaluation of mathematical reasoning capabilities constitutes a critical pathway toward achieving Artificial General Intelligence (AGI). Prevailing benchmarks including MATH and AIME mainly feature single-instantiation problems with fixed numbers, permitting pattern matching instead of principled deductive reasoning and leaving generalization on isomorphic problem variants untested. To address these limitations, we propose the UTMath Benchmark, employing rigorous unit testing methodology that simultaneously quantifies solution accuracy and solution space generality. It comprises 1,053 problems spanning 9 mathematical domains, each accompanied by an average of 68 varied test cases. With answer possibilities per problem on average, UTMath sets new standards for robust reasoning while preventing memorization. UTMath is highly challenging, with the best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%. We further propose Reasoning-to-Code Thoughts (RCoT), a prompting strategy that decouples symbolic reasoning from code synthesis. RCoT guides LLMs to first derive formal reasoning structures before generating executable code, producing generalizable solutions rather than situation-specific answers. To help the community push mathematical reasoning further, we release UTMath-Train (70k samples), a companion training set generated under the same protocol. Our benchmark can be accessed via the following link: [UTMath](https://utmathhomepage.github.io/)
Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents the first comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies specific to these powerful reasoning-enhanced models. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.