Hao Yi


Fixing paper assignments

  1. Please select all papers that do not belong to this person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin
Hao Yi | Qingyang Li | Yulan Hu | Fuzheng Zhang | Di Zhang | Yong Liu
Findings of the Association for Computational Linguistics: EMNLP 2025

Enhancing the numerical and logical reasoning capabilities of Large Language Models (LLMs) has become a prominent research focus. Existing approaches exhibit notable limitations: inference-phase techniques, such as Chain of Thought, depend on prompt engineering and pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle to ensure step-wise mathematical correctness and often rely on model distillation or human annotations; Reinforcement Learning (RL) methods entail high GPU memory consumption and training instability. To overcome these challenges, we propose Self-training with Process Preference learning using Dynamic value margin (SPPD). SPPD formulates reasoning as a process-based Markov Decision Process (MDP), leveraging the Bellman optimality equation to derive a dynamic value margin for step-level preference optimization. It further incorporates tree-based self-sampling of model responses, eliminating the need for distillation. We theoretically establish that SPPD is equivalent to on-policy policy gradient methods under constrained reward functions. Experimental results on 7B-scale models show consistent superiority across both in-domain and out-of-domain mathematical benchmarks.