Aron Gohr


2026

Large language models can generate feedback on free-form student writing, but it is unclear whether such feedback is correct and pedagogically useful. We evaluate LLM-generated feedback on 65 undergraduate proof-writing exercises using Hattie and Timperley’s feedback framework and a grade agreement metric, comparing two models (GPT-4.1, GPT-5) under two workflow configurations graded by two independent LLM evaluators. A mark-scheme-augmented workflow improves grade correlation with human experts for both models, and its precomputed mark schemes allow instructors to audit the system before deployment. GPT-5 produces higher-quality feedback across all dimensions. The metrics we collect give some evidence that in the setting studied, feedback quality is high, and several sanity checks on our experiments support this finding. However, providing meaningful self-regulation support and controlled tests with students remain to be done. The results in this contribution show that feedback theory provides a useful lens for evaluating automated mathematical feedback.