Zaiyuan Di


2026

While recent studies have increasingly emphasized the role of reflection in code repair tasks, existing benchmarks still target the repair generation capability of LLMs, lacking fine-grained evaluation of reflection generation capability. To this end, we propose Code Reffix, a benchmark featuring an automated pipeline with oracle reflections and a dual-task protocol to decouple the evaluation of reflection from repair. Through extensive experiments on 14 LLMs and fine-tuning analysis, we aim to pinpoint performance bottlenecks of code repair, quantify reflection quality, and verify the value of reflection optimization. Evaluations reveal that underperforming reflection capabilities of small-scale LLMs remain a major bottleneck for code repair. By quantifying this gap, Code Reffix provides a critical foundation for optimizing LLMs to achieve superior repair performance.