Shivanshu Verma

2025

pdf bib abs
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks
Amir Saeidi | Shivanshu Verma | Md Nayem Uddin | Chitta Baral
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

This study evaluates Direct Preference Optimization (DPO) and its variants for aligning Large Language Models (LLMs) with human preferences, testing three configurations: (1) with Supervised Fine-Tuning (SFT), (2) without SFT, and (3) without SFT but using an instruction-tuned model. We further investigate how training set size influences model performance. Our evaluation spans 13 benchmarks—covering dialogue, reasoning, mathematical problem-solving, question answering, truthfulness, MT-Bench, Big Bench, and the Open LLM Leaderboard. We find that: (1) alignment methods often achieve near-optimal performance even with smaller subsets of training data; (2) although they offer limited improvements on complex reasoning tasks, they enhance mathematical problem-solving; and (3) using an instruction-tuned model improves truthfulness. These insights highlight the conditions under which alignment methods excel, as well as their limitations.

Co-authors

Venues

acl1
ws1

Fix data

Shivanshu Verma

Fixing paper assignments

2025

Co-authors

Venues