Meng Zhao
Other people with similar names: Meng Zhao
Unverified author pages with similar names: Meng Zhao
2025
Let’s Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models
Kangyang Luo | Zichen Ding | Zhenmin Weng | Lingfeng Qiao | Meng Zhao | Xiang Li | Di Yin | Jinlong Shu
Findings of the Association for Computational Linguistics: ACL 2025
Kangyang Luo | Zichen Ding | Zhenmin Weng | Lingfeng Qiao | Meng Zhao | Xiang Li | Di Yin | Jinlong Shu
Findings of the Association for Computational Linguistics: ACL 2025
While Chain of Thought (CoT) prompting approaches have significantly consolidated the reasoning capabilities of large language models (LLMs), they still face limitations that require extensive human effort or have performance needs to be improved. Existing endeavors have focused on bridging these gaps; however, these approaches either hinge on external data and cannot completely eliminate manual effort, or they fall short in effectively directing LLMs to generate high-quality exemplary prompts. To address the said pitfalls, we propose a novel prompt approach for automatic reasoning named LBS3, inspired by curriculum learning which better reflects human learning habits. Specifically, LBS3 initially steers LLMs to recall easy-to-hard proxy queries that are pertinent to the target query. Following this, it invokes a progressive strategy that utilizes exemplary prompts stemmed from easy-proxy queries to direct LLMs in solving hard-proxy queries, enabling the high-quality of the proxy solutions. Finally, our extensive experiments in various reasoning-intensive tasks with varying open- and closed-source LLMs show that LBS3 achieves strongly competitive performance compared to the SOTA baselines.
2024
Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence
Junru Lu | Jiazheng Li | Siyu An | Meng Zhao | Yulan He | Di Yin | Xing Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Junru Lu | Jiazheng Li | Siyu An | Meng Zhao | Yulan He | Di Yin | Xing Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement Learning from Human Feedback (RLHF). Despite its promising efficacy, DPO faces a notable drawback: “verbosity”, a common over-optimization phenomenon also observed in RLHF. While previous studies mainly attributed verbosity to biased labels within the data, we propose that the issue also stems from an inherent algorithmic length reliance in DPO. Specifically, we suggest that the discrepancy between sequence-level Kullback–Leibler (KL) divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths. Empirically, we utilize datasets with different label lengths to demonstrate the presence of biased rewards. We then introduce an effective downsampling approach, named SamPO, to eliminate potential length reliance. Our experimental evaluations, conducted across three LLMs of varying scales and a diverse array of conditional and open-ended benchmarks, highlight the efficacy of SamPO in mitigating verbosity, achieving improvements of 5% to 12% over DPO through debaised rewards. Our code can be accessed at: https://github.com/LuJunru/SamPO/.