Guangyao Shen

2026

CURE: Critique-Driven Unified Reinforcement Learning for Test-Time Self-Improvement
Guirong Chen | Shuqi Ye | Wenkai Yang | Shiqi Shen | Guangyao Shen | Yankai Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The evolution paradigm of Large Language Models (LLMs) is shifting from scaling training compute to scaling inference-time compute. While Reinforcement Learning with Verifiable Rewards (RLVR) has become a key engine for this transition, standard approaches often fail to equip models with the autonomous improvement capabilities required for test-time scaling. Existing critique-guided methods attempt to mitigate this by leveraging external feedback or ground-truth signals; however, these dependencies are unavailable at test time, fundamentally limiting the model’s capacity for continuous self-improvement. To bridge this gap, we propose CURE (Critique-driven Unified REinforcement Learning), a framework that jointly optimizes a single policy for standard solving, critiquing, and guided re-exploration. Uniquely, CURE facilitates re-exploration by generating strategic hints while discarding initial incorrect solutions to mitigate anchoring bias.Empirical results across diverse mathematical reasoning and code generation benchmarks demonstrate that CURE not only maintains competitive single-turn performance but, more importantly, unlocks effective inference-time scaling, enabling the model to significantly boost accuracy through iterative self-improvement.

2024

pdf bib abs

Recently, tool use with LLMs has become one of the primary research topics as it can help LLM generate truthful and helpful responses. Existing studies on tool use with LLMs primarily focus on enhancing the tool-calling ability of LLMs. In practice, like chat assistants, LLMs are also required to align with human values in the context of tool use. Specifically, LLMs should refuse to answer unsafe tool use relevant instructions and insecure tool responses to ensure their reliability and harmlessness. At the same time, LLMs should demonstrate autonomy in tool use to reduce the costs associated with tool calling. To tackle this issue, we first introduce the principle that LLMs should follow in tool use scenarios: H2A. The goal of H2A is to align LLMs with **helpfulness**, **harmlessness**, and **autonomy**. In addition, we propose ToolAlign, a dataset comprising instruction-tuning data and preference data to align LLMs with the H2A principle for tool use. Based on ToolAlign, we develop LLMs by supervised fine-tuning and preference learning, and experimental results demonstrate that the LLMs exhibit remarkable tool-calling capabilities, while also refusing to engage with harmful content, and displaying a high degree of autonomy in tool utilization. The code and datasets are available at: https://github.com/zhiyuanc2001/ToolAlign.

Co-authors

Wenkai Yang 1

Shuqi Ye 1

Gong Zhi 1

Venues

ACL1
EMNLP1

Fix author