Geon Park


2025

This paper presents automatic evaluation systems for assessing the pedagogical capabilities of LLM-based AI tutors. Drawing from a shared task, our systems specifically target four key dimensions of tutor responses: Mistake Identification, Mistake Location, Providing Guidance, and Actionability. These dimensions capture the educational quality of responses from multiple perspectives, including the ability to detect student mistakes, accurately identify error locations, provide effective instructional guidance, and offer actionable feedback. We propose GPT-4.1-based automatic evaluation systems, leveraging their strong capabilities in comprehending diverse linguistic expressions and complex conversational contexts to address the detailed evaluation criteria across these dimensions. Our systems were quantitatively evaluated based on the official criteria of each track. In the Mistake Location track, our evaluation systems achieved an Exact macro F1 score of 58.80% (ranked in the top 3), and in the Providing Guidance track, they achieved 56.06% (ranked in the top 5). While the systems showed mid-range performance in the remaining tracks, the overall results demonstrate that our proposed automatic evaluation systems can effectively assess the quality of tutor responses, highlighting their potential for evaluating AI tutor effectiveness.