Yizhen Zhang
2025
Teaching Your Models to Understand Code via Focal Preference Alignment
Jie Wu
|
Haoling Li
|
Xin Zhang
|
Xiao Liu
|
Yangyu Huang
|
Jianwen Luo
|
Yizhen Zhang
|
Zuchao Li
|
Ruihang Chu
|
Yujiu Yang
|
Scarlett Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
Search
Fix author
Co-authors
- Ruihang Chu 1
- Yangyu Huang 1
- Haoling Li 1
- Zuchao Li 1
- Scarlett Li 1
- show all...