When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback
Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, Valerie Chen
Abstract
Programming with a coding assistant is a fundamentally interactive process, yet existing static benchmarks fail to capture key features of model-user collaboration. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting, in which we obfuscate the input of static coding benchmarks so that the code model must interact with a simulated user. Across 10 models and 3 datasets, the relative rankings of models often permute greatly between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that similarly effective feedback types differ in terms of how models respond to higher- vs. lower-quality feedback. Moreover, feedback type impacts the degree to which the models make aesthetic or behavioral edits to their output. Our work aims to “re-evaluate” model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.- Anthology ID:
- 2025.findings-acl.1267
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 24672–24700
- Language:
- URL:
- https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1267/
- DOI:
- Cite (ACL):
- Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, and Valerie Chen. 2025. When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24672–24700, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback (Pan et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1267.pdf