Klaus Giebermann

2026

Multi-step Large Language Model for Fine-Grained Feedback in Stepwise Linear Equation Solutions
Imran Chamieh | Torsten Zesch | Klaus Giebermann
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

This paper addresses the problem of fine-grained error classification in stepwise algebraic problem solving, with the objective of enabling accurate and timely feedback in large-scale educational environments. Using authentic student response data, we compare a carefully engineered rule-based baseline with large language models (LLMs) in zero-shot and few-shot configurations, as well as multistep LLM-based approaches. We further consider hybrid architectures that combine symbolic computation with LLM inferential processes, with particular emphasis on enhancing the robustness and faithfulness of intermediate representations and mitigating error propagation across successive stages of the computational pipeline. Our empirical results indicate that, although the baseline model delivers strong and reliable performance for narrowly defined error categories, structured multi-step approaches improve performance relative to single-step methods by achieving superior precision, F1 scores, and overall accuracy.

2024

pdf bib abs

LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches
Imran Chamieh | Torsten Zesch | Klaus Giebermann
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

In this work, we investigate the potential of Large Language Models (LLMs) for automated short answer scoring. We test zero-shot and few-shot settings, and compare with fine-tuned models and a supervised upper-bound, across three diverse datasets. Our results, in zero-shot and few-shot settings, show that LLMs perform poorly in these settings: LLMs have difficulty with tasks that require complex reasoning or domain-specific knowledge. While the models show promise on general knowledge tasks. The fine-tuned model come close to the supervised results but are still not feasible for application, highlighting potential overfitting issues. Overall, our study highlights the challenges and limitations of LLMs in short answer scoring and indicates that there currently seems to be no basis for applying LLMs for short answer scoring.

Co-authors

Imran Chamieh 2
Torsten Zesch 2

Venues

BEA2
WS1

Fix author