Gihyeon Choi


2025

pdf bib
K-NLPers at BEA 2025 Shared Task: Evaluating the Quality of AI Tutor Responses with GPT-4.1
Geon Park | Jiwoo Song | Gihyeon Choi | Juoh Sun | Harksoo Kim
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

This paper presents automatic evaluation systems for assessing the pedagogical capabilities of LLM-based AI tutors. Drawing from a shared task, our systems specifically target four key dimensions of tutor responses: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. These dimensions capture the educational quality of responses from multiple perspectives, including the ability to detect student mistakes, accurately identify error locations, provide effective instructional guidance, and offer actionable feedback. We propose GPT-4.1-based automatic evaluation systems, leveraging their strong capabilities in comprehending diverse linguistic expressions and complex conversational contexts to address the detailed evaluation criteria across these dimensions. Our systems were quantitatively evaluated based on the official criteria of each track. In the Mistake Location track, our evaluation systems achieved an Exact macro F1 score of 58.80% (ranked in the top 3), and in the Providing Guidance track, they achieved 56.06% (ranked in the top 5). While the systems showed mid-range performance in the remaining tracks, the overall results demonstrate that our proposed automatic evaluation systems can effectively assess the quality of tutor responses, highlighting their potential for evaluating AI tutor effectiveness.