Fei Huang
Other people with similar names: Fei Huang, Fei Huang
Unverified author pages with similar names: Fei Huang
2026
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
Binghai Wang | Yantao Liu | Yuxuan Liu | Tianyi Tang | Shenzhi Wang | Chang Gao | Chujie Zheng | Yichang Zhang | Le Yu | Shixuan Liu | Tao Gui | Qi Zhang | Xuanjing Huang | Bowen Yu | Fei Huang | Junyang Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Binghai Wang | Yantao Liu | Yuxuan Liu | Tianyi Tang | Shenzhi Wang | Chang Gao | Chujie Zheng | Yichang Zhang | Le Yu | Shixuan Liu | Tao Gui | Qi Zhang | Xuanjing Huang | Bowen Yu | Fei Huang | Junyang Lin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model’s reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.
ToolRM: Towards Agentic Tool-Use Reward Modeling
Renhao Li | Jianhong Tu | Yang Su | Yantao Liu | Fei Huang | Hamid Alinejad-Rokny | Derek F. Wong | Junyang Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Renhao Li | Jianhong Tu | Yang Su | Yantao Liu | Fei Huang | Hamid Alinejad-Rokny | Derek F. Wong | Junyang Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs high-quality pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBenchBFCL, a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 17.94% higher accuracy, substantially outperforming frontier LLMs and RMs in pairwise reward judgments. Beyond training objectives, generative ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling while reducing output token usage by over 66%. Its support for downstream RL training further validates its practical utility. We release data to facilitate future research.