Ute Harms
2026
Report on the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German
Sebastian Gombert | Zhifan Sun | Fabian Zehner | Jannik Lossjew | Tobias Wyrwich | Berrit Czinczel | David Bednorz | Sascha Bernholt | Knut Neumann | Ute Harms | Aiso Heinze | Hendrik Drachsler
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Sebastian Gombert | Zhifan Sun | Fabian Zehner | Jannik Lossjew | Tobias Wyrwich | Berrit Czinczel | David Bednorz | Sascha Bernholt | Knut Neumann | Ute Harms | Aiso Heinze | Hendrik Drachsler
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
We present the BEA 2026 shared task on rubric-based short answer scoring for German. Rubric-based short answer scoring is a case of automatic short answer scoring (ASAS) that requires models to apply textual scoring rubrics to student answers as a basis for assigning scores. For the shared task, we introduced a novel German-language dataset from multiple STEM domains to provide a comprehensive benchmark for this problem. The dataset was designed to evaluate both performance and generalization (the latter, by distinguishing between seen and unseen questions), as well as coarse- and fine-grained scoring (2-way vs. 3-way). The systems submitted to the shared task cover a wide range of approaches, including fine-tuned large language models, prompt-based methods, human-AI collaboration strategies, or a combination of these. The results show that structured, task-adapted LLM systems achieved the strongest performance across all tracks. The winning system, IWM-DKM, combined LoRA fine-tuning of Qwen models with rubric-aware input structuring, including checklist-style reasoning, rubric reframing as decision trees, background knowledge injection, and ensemble voting. Other systems similarly relied on fine-tuned LLMs, retrieval-augmented prompting, encoder–LLM ensembles, or weighted aggregation strategies. Overall, the shared task results show that rubric-based scoring benefits most from systems that explicitly operationalise rubric semantics, while generalisation to unseen questions remains a central challenge.