This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
JulianKatz-Samuels
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
When aligning large language models (LLMs), their performance across various tasks (such as being helpful, harmless, and honest) is heavily influenced by the composition of the training data. However, it is difficult to determine what mixture of data should be used to produce a model with strong performance across all tasks. Existing approaches rely on large ablation studies, heuristics, or human intuition, though these can be prohibitively expensive and suboptimal. We study this problem in the context of preference optimization via DPO and propose a novel and theoretically justified algorithm, AutoMixAlign (AMA), that adaptively mixes datasets during LLM training to balance performance across multiple tasks. AMA first trains specialist models for each task to determine losses that corresponding to strong task performance. Next, AMA trains a generalist model using a novel minimax optimization that prioritizes tasks for which generalist model losses are furthest from specialist model losses. We introduce two algorithms to optimize this problem: (1) AMA-R adaptively reweights the objective to prioritize tasks, and (2) AMA-S adaptively adjusts how much data is sampled from each task to prioritize tasks. Both algorithms achieve a convergence rate of O(1/√T) in the convex case. AMA-R’s convergence result immediately follows from Sagawa et. al, 2019, and we provide a convergence proof for AMA-S using techniques from online learning such as EXP3 (Auer et. al, 2002). We evaluate AMA on several multitask alignment setups, and observe that AMA outperforms the standard alignment approach which simply optimizes the total loss across all tasks and also outperforms model-merging methods.
We study the post-training of large language models (LLMs) with human preference data. Recently, direct preference optimization and its variants have shown considerable promise in aligning language models, eliminating the need for reward models and online sampling. Despite these benefits, these methods rely on explicit assumptions about the Bradley-Terry (BT) model, which makes them prone to overfitting and results in suboptimal performance, particularly on reasoning-heavy tasks. To address these challenges, we propose a principled preference fine-tuning algorithm called InfoPO, which effectively and efficiently aligns large language models using preference data. InfoPO eliminates the reliance on the BT model and prevents the likelihood of the chosen response from decreasing. Extensive experiments confirm that InfoPO consistently outperforms established baselines on widely used open benchmarks, particularly in reasoning tasks.
The ability of large language models (LLMs) to execute complex instructions is essential for their real-world applications. However, several recent studies indicate that LLMs struggle with challenging instructions. In this paper, we propose Evolutionary Contrastive Distillation (ECD), a novel method for generating high-quality synthetic preference data designed to enhance the complex instruction-following capability of language models. ECD generates data that specifically illustrates the difference between a response that successfully follows a set of complex instructions and a response that is high-quality, but nevertheless makes some subtle mistakes. This is done by prompting LLMs to progressively evolve simple instructions to more complex instructions. When the complexity of an instruction is increased, the original successful response to the original instruction becomes a “hard negative” response for the new instruction, mostly meeting requirements of the new instruction, but barely missing one or two. By pairing a good response with such a hard negative response, and employing contrastive learning algorithms such as DPO, we improve language models’ ability to follow complex instructions. Empirically, we observe that our method yields a 7B model that exceeds the complex instruction-following performance of current SOTA 7B models and is competitive even with open-source 70B models.