Tongliang Liu

2026

Select Before Use: On the Importance of Reference Model Selection in Preference Alignment
Muyang Li | Runze Wu | Xiangyu Zhao | Bo Han | Daoyi Dong | Tongliang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The post-training stage of Large Language Models (LLMs) typically involves Supervised Fine-Tuning (SFT) followed by preference alignment to ensure LLM to generate safe, helpful, and instruction-aligned content. The SFT model critically serves as both the initialization and reference model for subsequent preference alignment. However, an essential yet often neglected question is the optimal selection of the SFT checkpoint for this role. We show that checkpoint selection substantially affects final performance, and that the common practice of choosing the minimum validation-loss checkpoint often fails, due to a fundamental conflict between SFT’s focus on imitation and alignment’s goal of response discriminability. To this end, we propose RewardRank, a simple, effective, training-free metrics for estimating initial implicit alignment between reference model and preference objective. Empirical evidence suggests that, using our selected model as reference can gain up to 67.6% relative increase on length-controlled win rate on the popular Zephyr recipe comparing to baselines.

2024

pdf bib abs

Contemporary practices in instruction tuning often hinge on enlarging data scaling without a clear strategy for ensuring data quality, inadvertently introducing noise that may compromise model performance. To address this challenge, we introduce Nuggets, a novel and efficient methodology that leverages one-shot learning to discern and select high-quality instruction data from extensive datasets. Nuggets assesses the potential of individual instruction examples to act as effective one-shot learning instances, thereby identifying those that can significantly improve performance across diverse tasks. Nuggets utilizes a scoring system based on the impact of candidate examples on the perplexity of a diverse anchor set, facilitating the selection of the most advantageous data for instruction tuning. Through rigorous evaluations on two benchmarks, namely MT-Bench and Alpaca-Eval, our study illustrates that instruction tuning with the top 1% of examples curated by Nuggets substantially outperforms conventional methods employing the entire dataset.

Co-authors

Venues

ACL2

Fix author