Fabian Zehner

2025

We present the framework Omethi, which is aimed at scoring short text responses in a semi-automatic fashion, particularly fit to international large-scale assessments. We evaluate its effectiveness for the massively multilingual PISA tests. Responses are passed through a conditional flow of hierarchically combined scoring components to assign a score. Once a score is assigned, hierarchically lower components are discarded. Models implemented in this study ranged from lexical matching of normalized texts—with excellent accuracy but weak generalizability—to fine-tuned large language models—with lower accuracy but high generalizability. If not scored by any automatic component, responses are passed on to manual scoring. The paper is the first to provide an evaluation of automatic scoring on multilingual PISA data in eleven languages (including Arabic, Finnish, Hebrew, and Kazakh) from three domains (_n_ = 3.8 million responses). On average, results show a manual effort reduction of 71 percent alongside an agreement of _κ_ = .957, when including manual scoring, and _κ_ = .804 for only the automatically scored responses. The evaluation underscores the framework’s effective adaptivity and operational feasibility with its shares of used components varying substantially across domains and languages while maintaining homogeneously high accuracy.

pdf bib abs
TBA at BEA 2025 Shared Task: Transfer-Learning from DARE-TIES Merged Models for the Pedagogical Ability Assessment of LLM-Powered Math Tutors
Sebastian Gombert | Fabian Zehner | Hendrik Drachsler
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

This paper presents our contribution to the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-Powered Tutors. The objective of this shared task was to assess the quality of conversational feedback provided by LLM-based math tutors to students regarding four facets: whether the tutors 1) identified mistakes, 2) identified the mistake’s location, 3) provided guidance, and whether they 4) provided actionable feedback. To leverage information across all four labels, we approached the problem with FLAN-T5 models, which we fit for this task using a multi-step pipeline involving regular fine-tuning as well as model merging using the DARE-TIES algorithm. We can demonstrate that our pipeline is beneficial to overall model performance compared to regular fine-tuning. With results on the test set ranging from 52.1 to 68.6 in F1 scores and 62.2% to 87.4% in accuracy, our best models placed 11th of 44 teams in Track 1, 8th of 31 teams in Track 2, 11th of 35 teams in Track 3, and 9th of 30 teams in Track 4. Notably, the classifiers’ recall was relatively poor for underrepresented classes, indicating even greater potential for the employed methodology.

Co-authors

Emily Kerzabi 1

Hyo Jeong Shin 1

Torsten Zesch 1

Venues

bea2
ws2

Fix author