Qianou Ma
2026
RECAP: An End-to-End Platform for Capturing, Replaying, and Analyzing AI-Assisted Programming Interactions
Keyu He | Qianou Ma | Valerie Chen | Wayne Chi | Tongshuang Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Keyu He | Qianou Ma | Valerie Chen | Wayne Chi | Tongshuang Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Understanding how developers interact with AI coding assistants requires more than chat logs or git histories in isolation; it requires reconstructing the full context: which prompt led to which edit, what the developer tried and discarded, and how their strategy evolved over time. We present RECAP (Replay and Examine Captured AI Programming), an open-source platform that (1) passively records AI chat sessions and fine-grained code edits inside VS Code without disrupting the developer’s workflow, (2) merges them into a unified timeline for interactive session replay, and (3) exposes an extensible analysis layer, with example modules for behavioral classification and AI reliance measurement. Deployed in a university software engineering course, RECAP captured 2,034 prompts and 8,239 code edits from 41 students across a multi-week project. We demonstrate how the platform’s linked data and replay capabilities enable analyses of developer-AI interaction patterns that no single data source could support.
2025
SPHERE: An Evaluation Card for Human-AI Systems
Dora Zhao | Qianou Ma | Xinran Zhao | Chenglei Si | Chenyang Yang | Ryan Louie | Ehud Reiter | Diyi Yang | Tongshuang Wu
Findings of the Association for Computational Linguistics: ACL 2025
Dora Zhao | Qianou Ma | Xinran Zhao | Chenglei Si | Chenyang Yang | Ryan Louie | Ehud Reiter | Diyi Yang | Tongshuang Wu
Findings of the Association for Computational Linguistics: ACL 2025
In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.