Qianxi He

2025

Logical reasoning is essential for large language models (LLMs) to ensure accurate and coherent inference. However, LLMs struggle with reasoning order variations and fail to generalize across logically equivalent transformations. LLMs often rely on fixed sequential patterns rather than true logical understanding. To address this issue, we introduce an order-centric data augmentation framework based on commutativity in logical reasoning. We first randomly shuffle independent premises to introduce condition order augmentation. For reasoning steps, we construct a directed acyclic graph (DAG) to model dependencies between steps, which allows us to identify valid reorderings of steps while preserving logical correctness. By leveraging order-centric augmentations, models can develop a more flexible and generalized reasoning process. Finally, we conduct extensive experiments across multiple logical reasoning benchmarks, demonstrating that our method significantly enhances LLMs’ reasoning performance and adaptability to diverse logical structures. We release our codes and augmented data in https://anonymous.4open.science/r/Order-Centric-Data-Augmentation-822C.

pdf bib abs
Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning
Qianxi He | Qingyu Ren | Shanzhe Lei | Xuhong Wang | Yingchun Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, most current methods rely on rule-based evaluations of answer correctness, overlooking the importance of confidence-aware reasoning, especially for small to medium-sized models. These models often receive rewards for speculative answers without generating coherent reasoning chains. To address this limitation, we propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in https://github.com/qianxiHe147/C2RM.

2024

pdf bib abs
From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models
Qianyu He | Jie Zeng | Qianxi He | Jiaqing Liang | Yanghua Xiao
Findings of the Association for Computational Linguistics: EMNLP 2024

It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions, especially those with lower complexity levels. Additionally, we further propose methods addressing how to obtain and utilize the effective training data. Finally, we conduct extensive experiments to prove the effectiveness of our methods in terms of overall performance and training efficiency. We also demonstrate that our methods improve models’ ability to follow instructions generally and generalize effectively across out-of-domain, in domain, and adversarial settings, while maintaining general capabilities.

Co-authors

Fei Yu 1

Venues

emnlp2
findings1

Fix author