Fabian Zehner
2026
Report on the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German
Sebastian Gombert | Zhifan Sun | Fabian Zehner | Jannik Lossjew | Tobias Wyrwich | Berrit Czinczel | David Bednorz | Sascha Bernholt | Knut Neumann | Ute Harms | Aiso Heinze | Hendrik Drachsler
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Sebastian Gombert | Zhifan Sun | Fabian Zehner | Jannik Lossjew | Tobias Wyrwich | Berrit Czinczel | David Bednorz | Sascha Bernholt | Knut Neumann | Ute Harms | Aiso Heinze | Hendrik Drachsler
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
We present the BEA 2026 shared task on rubric-based short answer scoring for German. Rubric-based short answer scoring is a case of automatic short answer scoring (ASAS) that requires models to apply textual scoring rubrics to student answers as a basis for assigning scores. For the shared task, we introduced a novel German-language dataset from multiple STEM domains to provide a comprehensive benchmark for this problem. The dataset was designed to evaluate both performance and generalization (the latter, by distinguishing between seen and unseen questions), as well as coarse- and fine-grained scoring (2-way vs. 3-way). The systems submitted to the shared task cover a wide range of approaches, including fine-tuned large language models, prompt-based methods, human-AI collaboration strategies, or a combination of these. The results show that structured, task-adapted LLM systems achieved the strongest performance across all tracks. The winning system, IWM-DKM, combined LoRA fine-tuning of Qwen models with rubric-aware input structuring, including checklist-style reasoning, rubric reframing as decision trees, background knowledge injection, and ensemble voting. Other systems similarly relied on fine-tuned LLMs, retrieval-augmented prompting, encoder–LLM ensembles, or weighted aggregation strategies. Overall, the shared task results show that rubric-based scoring benefits most from systems that explicitly operationalise rubric semantics, while generalisation to unseen questions remains a central challenge.
Rubrics as Semantic Subspaces: A Unified Approach to Rubric-based Constructed Response Scoring across Short Answers and Essays
Sebastian Gombert | Sonja Hahn | Nico Andersen | Leon Camus | Zhifan Sun | Ngoc Nhu Hao Nguyen | Fabian Zehner | Longwei Cong | Alexander Mehler | Hendrik Drachsler
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Sebastian Gombert | Sonja Hahn | Nico Andersen | Leon Camus | Zhifan Sun | Ngoc Nhu Hao Nguyen | Fabian Zehner | Longwei Cong | Alexander Mehler | Hendrik Drachsler
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Rubrics are the primary reference for manual scoring of constructed responses, and there is growing interest in their use in automated scoring methodologies. In this work, we propose Aspect-Grounded Rubric–Answer Alignment (AGRAA), a rubric-based end-to-end scoring framework that models rubric descriptors as latent aspect spaces. Concretely, rubric descriptors are represented as low-dimensional subspaces derived from contextualised transformer embeddings, and student responses are scored according to how strongly their representations align with these rubric-induced spaces relative to the residual space outside them. This formulation provides a geometrically grounded interpretation of rubric-based scoring while enabling end-to-end training with standard transformer encoders. We introduce three distinct architectural variants and evaluate them on multiple short-answer and essay scoring datasets. Across these tasks, AGRAA achieves predictive performance highly competitive with strong neural and feature-based baselines. In addition, the framework yields interpretable intermediate representations that expose which rubric-defined aspects contribute to scoring decisions, enabling decision-aligned explanations grounded in rubric descriptors.
2025
TBA at BEA 2025 Shared Task: Transfer-Learning from DARE-TIES Merged Models for the Pedagogical Ability Assessment of LLM-Powered Math Tutors
Sebastian Gombert | Fabian Zehner | Hendrik Drachsler
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Sebastian Gombert | Fabian Zehner | Hendrik Drachsler
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
This paper presents our contribution to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors. The objective of this shared task was to assess the quality of conversational feedback provided by LLM-based math tutors to students regarding four facets: whether the tutors 1) identified mistakes, 2) identified the mistake’s location, 3) provided guidance, and whether they 4) provided actionable feedback. To leverage information across all four labels, we approached the problem with FLAN-T5 models, which we fit for this task using a multi-step pipeline involving regular fine-tuning as well as model merging using the DARE-TIES algorithm. We can demonstrate that our pipeline is beneficial to overall model performance compared to regular fine-tuning. With results on the test set ranging from 52.1 to 68.6 in F1 scores and 62.2% to 87.4% in accuracy, our best models placed 11th of 44 teams in Track 1, 8th of 31 teams in Track 2, 11th of 35 teams in Track 3, and 9th of 30 teams in Track 4. Notably, the classifiers’ recall was relatively poor for underrepresented classes, indicating even greater potential for the employed methodology.
Down the Cascades of Omethi: Hierarchical Automatic Scoring in Large-Scale Assessments
Fabian Zehner | Hyo Jeong Shin | Emily Kerzabi | Andrea Horbach | Sebastian Gombert | Frank Goldhammer | Torsten Zesch | Nico Andersen
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Fabian Zehner | Hyo Jeong Shin | Emily Kerzabi | Andrea Horbach | Sebastian Gombert | Frank Goldhammer | Torsten Zesch | Nico Andersen
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
We present the framework Omethi, which is aimed at scoring short text responses in a semi-automatic fashion, particularly fit to international large-scale assessments. We evaluate its effectiveness for the massively multilingual PISA tests. Responses are passed through a conditional flow of hierarchically combined scoring components to assign a score. Once a score is assigned, hierarchically lower components are discarded. Models implemented in this study ranged from lexical matching of normalized texts—with excellent accuracy but weak generalizability—to fine-tuned large language models—with lower accuracy but high generalizability. If not scored by any automatic component, responses are passed on to manual scoring. The paper is the first to provide an evaluation of automatic scoring on multilingual PISA data in eleven languages (including Arabic, Finnish, Hebrew, and Kazakh) from three domains (n = 3.8 million responses). On average, results show a manual effort reduction of 71 percent alongside an agreement of 𝜅 = .957, when including manual scoring, and 𝜅 = .804 for only the automatically scored responses. The evaluation underscores the framework’s effective adaptivity and operational feasibility with its shares of used components varying substantially across domains and languages while maintaining homogeneously high accuracy.