The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose 𝛾-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, 𝛾-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, 𝛾-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, 𝛾-PO achieves an average 4.4% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, 𝛾-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at https://github.com/sunjie279/gammaPO.
Auctions are a vital economic mechanism used to determine the market value of goods or services through competitive bidding within a specific framework. However, much of the current research primarily focuses on the bidding algorithms used within auction mechanisms. This often neglects the potential benefits of incorporating individual users’ unique preferences into the valuation process. Our theoretical and empirical analysis demonstrates that valuation errors can significantly impact the overall utility. To bridge this gap, we propose a personalized valuation framework, namely Large Language Models-powered Personalized Valuation (LaMP-Val), which integrates Large Language Models to incorporate personalized semantic preference into users valuation process. LaMP-Val integrating three components: data, learning, and evaluation. The data component tackles the challenge of building a novel dataset specifically for LLMs fine-tuning in personalized valuation modeling. The learning component introduces a diversity template to enhance LLMs’ capacity for modeling fine-grained personal valuation patterns. The evaluation component establishes a closed-loop system where LLM-generated valuations interact with bidding strategies and auction. It proposes two novel metrics to quantify valuation precision and bidding intention accuracy in personalized scenarios. Extensive experiments show that LaMP-Val more accurately captures personalized values and achieves greater profits than baseline approaches.
Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.