When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

Jane Pan; Ryan Shar; Jacob Pfau; Ameet Talwalkar; He He; Valerie Chen

When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, Valerie Chen

Abstract

Programming with a coding assistant is a fundamentally interactive process, yet existing static benchmarks fail to capture key features of model-user collaboration. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting, in which we obfuscate the input of static coding benchmarks so that the code model must interact with a simulated user. Across 10 models and 3 datasets, the relative rankings of models often permute greatly between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that similarly effective feedback types differ in terms of how models respond to higher- vs. lower-quality feedback. Moreover, feedback type impacts the degree to which the models make aesthetic or behavioral edits to their output. Our work aims to “re-evaluate” model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.

Anthology ID:: 2025.findings-acl.1267
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24672–24700
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1267/
DOI:
Bibkey:
Cite (ACL):: Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, and Valerie Chen. 2025. When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24672–24700, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback (Pan et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1267.pdf

PDF Cite Search Fix data