Fengshuo Bai
2026
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
Jianghao Lin | Yuanyuan Shi | Xin Peng | Renjie Ding | Hairui Wang | Yuxuan Peng | Bizhe Bai | Weixi Song | Fengshuo Bai | Huacan Chai | Weinan Zhang | Fei Huang | Ying Wen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianghao Lin | Yuanyuan Shi | Xin Peng | Renjie Ding | Hairui Wang | Yuxuan Peng | Bizhe Bai | Weixi Song | Fengshuo Bai | Huacan Chai | Weinan Zhang | Fei Huang | Ying Wen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) excel at function calling, but inference scaling has been explored mainly for unstructured generation. We propose an inference-scaling framework for structured outputs that combines fine-grained beam search with ToolPRM, a process reward model scoring each intra-call decision (function name and argument filling). We build the first fine-grained intra-call supervision dataset via function masking, rollout collection, and step-level annotation. ToolPRM outperforms outcome and coarse-grained reward models in predictive accuracy and yields consistent test-time gains on multiple function-calling benchmarks. We further show that structured generation follows “explore more but retain less”, since early JSON errors are unrecoverable.
2025
AdaptFlow: Adaptive Workflow Optimization via Meta-Learning
Runchuan Zhu | Bowen Jiang | Lingrui Mei | Fangkai Yang | Lu Wang | Haoxiang Gao | Fengshuo Bai | Pu Zhao | Qingwei Lin | Saravan Rajmohan | Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Runchuan Zhu | Bowen Jiang | Lingrui Mei | Fangkai Yang | Lu Wang | Haoxiang Gao | Fengshuo Bai | Pu Zhao | Qingwei Lin | Saravan Rajmohan | Dongmei Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advances in large language models (LLMs) have sparked growing interest in agentic workflows—structured sequences of LLM invocations designed to solve complex tasks. However, existing approaches often rely on static templates or manually designed workflows, which limit adaptability to diverse tasks and hinder scalability. We propose AdaptFlow, a natural language-based meta-learning framework inspired by model-agnostic meta-learning (MAML). AdaptFlow uses a bi-level optimization process: the inner loop performs task-specific adaptation via LLM-generated feedback, while the outer loop consolidates these refinements into a shared, generalizable initialization. Evaluated across question answering, code generation, and mathematical reasoning benchmarks, AdaptFlow consistently outperforms both manually crafted and automatically searched baselines, achieving state-of-the-art results with strong generalization across tasks and models.
GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation
Runchuan Zhu | Zinco Jiang | Jiang Wu | Zhipeng Ma | Jiahe Song | Fengshuo Bai | Dahua Lin | Lijun Wu | Conghui He
Findings of the Association for Computational Linguistics: NAACL 2025
Runchuan Zhu | Zinco Jiang | Jiang Wu | Zhipeng Ma | Jiahe Song | Fengshuo Bai | Dahua Lin | Lijun Wu | Conghui He
Findings of the Association for Computational Linguistics: NAACL 2025
Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .