Can Xu
Other people with similar names: Can Xu
Unverified author pages with similar names: Can Xu
2026
RubricBench: Aligning Model-Generated Rubrics with Human Standards
Junyi Zhou | Qiyuan Zhang | Yufei Wang | Fuyuan Lyu | Yidong Ming | Can Xu | Qingfeng Sun | Kai Zheng | Peng Kang | Xue Liu | Chen Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junyi Zhou | Qiyuan Zhang | Yufei Wang | Fuyuan Lyu | Yidong Ming | Can Xu | Qingfeng Sun | Kai Zheng | Peng Kang | Xue Liu | Chen Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.
Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
Qiyuan Zhang | Yufei Wang | Tianhe Wu | Can Xu | Qingfeng Sun | Kai Zheng | Xue Liu | Chen Ma
Findings of the Association for Computational Linguistics: ACL 2026
Qiyuan Zhang | Yufei Wang | Tianhe Wu | Can Xu | Qingfeng Sun | Kai Zheng | Xue Liu | Chen Ma
Findings of the Association for Computational Linguistics: ACL 2026
Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (multi-dimensional principle coverage) and Depth-CoT (substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured Breadth-CoT and Depth-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2%. Our results reveal a clear divergence in reasoning: Breadth-CoT benefits subjective preference tasks, whereas Depth-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands.
2025
WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models
Huawen Feng | Pu Zhao | Qingfeng Sun | Can Xu | Fangkai Yang | Lu Wang | Qianli Ma | Qingwei Lin | Saravan Rajmohan | Dongmei Zhang | Qi Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Huawen Feng | Pu Zhao | Qingfeng Sun | Can Xu | Fangkai Yang | Lu Wang | Qianli Ma | Qingwei Lin | Saravan Rajmohan | Dongmei Zhang | Qi Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite recent progress achieved by code large language models (LLMs), their remarkable abilities are largely dependent on fine-tuning on the high-quality data, posing challenges for data collection and annotation. To address this, current methods often design various data flywheels to collect complex code instructions, enabling models to handle more intricate tasks. However, these approaches typically rely on off-the-shelf datasets and data augmentation from a limited set of proprietary LLMs (e.g., Claude, GPT4, and so on), which restricts the diversity of the constructed data and makes it prone to systemic biases. In this paper, we propose **WarriorCoder**, a novel paradigm learns from expert battles to address these limitations. Specifically, we create an arena where leading expert code LLMs challenge each other, with evaluations conducted by impartial judges. This competitive framework generates novel training data from scratch, leveraging the strengths of all participants. Experimental results show that **WarriorCoder** achieves state-of-the-art performance compared to previous models of the same size, even without relying on proprietary LLMs.