Aron Gohr

2026

Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory
Aron Gohr | Marie-Amelie Lawn | Kevin Gao | Inigo Serjeant | Stephen Heslip
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

Large language models can generate feedback on free-form student writing, but it is unclear whether such feedback is correct and pedagogically useful. We evaluate LLM-generated feedback on 65 undergraduate proof-writing exercises using Hattie and Timperley’s feedback framework and a grade agreement metric, comparing two models (GPT-4.1, GPT-5) under two workflow configurations graded by two independent LLM evaluators. A mark-scheme-augmented workflow improves grade correlation with human experts for both models, and its precomputed mark schemes allow instructors to audit the system before deployment. GPT-5 produces higher-quality feedback across all dimensions. The metrics we collect give some evidence that in the setting studied, feedback quality is high, and several sanity checks on our experiments support this finding. However, providing meaningful self-regulation support and controlled tests with students remain to be done. The results in this contribution show that feedback theory provides a useful lens for evaluating automated mathematical feedback.

Co-authors

Venues

BEA1
WS1

Fix author