Yalin Wang
2026
DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning
Xiwen Chen | Wenhui Zhu | Peijie Qiu | Xuanzhao Dong | Hao Wang | Haiyu Wu | Huayu Li | Aris Sotiras | Yalin Wang | Abolfazl Razi
Findings of the Association for Computational Linguistics: ACL 2026
Xiwen Chen | Wenhui Zhu | Peijie Qiu | Xuanzhao Dong | Hao Wang | Haiyu Wu | Huayu Li | Aris Sotiras | Yalin Wang | Abolfazl Razi
Findings of the Association for Computational Linguistics: ACL 2026
Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies.To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape.Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and 55 cost, highlighting the critical role of diversity calibration in data-efficient alignment.
AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives
Yanxi Chen | Wenhui Zhu | Xiwen Chen | Zhipeng Wang | Xin Li | Peijie Qiu | Hao Wang | Xuanzhao Dong | Yujian Xiong | Anderson Schneider | Yuriy Nevmyvaka | Yalin Wang
Findings of the Association for Computational Linguistics: ACL 2026
Yanxi Chen | Wenhui Zhu | Xiwen Chen | Zhipeng Wang | Xin Li | Peijie Qiu | Hao Wang | Xuanzhao Dong | Yujian Xiong | Anderson Schneider | Yuriy Nevmyvaka | Yalin Wang
Findings of the Association for Computational Linguistics: ACL 2026
Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g., generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.