Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory

Aron Gohr; Marie-Amelie Lawn; Kevin Gao; Inigo Serjeant; Stephen Heslip

Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory

Aron Gohr, Marie-Amelie Lawn, Kevin Gao, Inigo Serjeant, Stephen Heslip

Abstract

Large language models can generate feedback on free-form student writing, but it is unclear whether such feedback is correct and pedagogically useful. We evaluate LLM-generated feedback on 65 undergraduate proof-writing exercises using Hattie and Timperley’s feedback framework and a grade agreement metric, comparing two models (GPT-4.1, GPT-5) under two workflow configurations graded by two independent LLM evaluators. A mark-scheme-augmented workflow improves grade correlation with human experts for both models, and its precomputed mark schemes allow instructors to audit the system before deployment. GPT-5 produces higher-quality feedback across all dimensions. The metrics we collect give some evidence that in the setting studied, feedback quality is high, and several sanity checks on our experiments support this finding. However, providing meaningful self-regulation support and controlled tests with students remain to be done. The results in this contribution show that feedback theory provides a useful lens for evaluating automated mathematical feedback.

Anthology ID:: 2026.bea-1.55
Volume:: Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Ekaterina Kochmar, Bashar Alhafni, Stefano Bannò, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anais Tack, Victoria Yaneva, Zheng Yuan
Venues:: BEA | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 813–818
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.55/
DOI:
Bibkey:
Cite (ACL):: Aron Gohr, Marie-Amelie Lawn, Kevin Gao, Inigo Serjeant, and Stephen Heslip. 2026. Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), pages 813–818, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory (Gohr et al., BEA 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.55.pdf

PDF Cite Search Fix data