Frank Goldhammer

2025

We present the framework Omethi, which is aimed at scoring short text responses in a semi-automatic fashion, particularly fit to international large-scale assessments. We evaluate its effectiveness for the massively multilingual PISA tests. Responses are passed through a conditional flow of hierarchically combined scoring components to assign a score. Once a score is assigned, hierarchically lower components are discarded. Models implemented in this study ranged from lexical matching of normalized texts—with excellent accuracy but weak generalizability—to fine-tuned large language models—with lower accuracy but high generalizability. If not scored by any automatic component, responses are passed on to manual scoring. The paper is the first to provide an evaluation of automatic scoring on multilingual PISA data in eleven languages (including Arabic, Finnish, Hebrew, and Kazakh) from three domains (_n_ = 3.8 million responses). On average, results show a manual effort reduction of 71 percent alongside an agreement of _κ_ = .957, when including manual scoring, and _κ_ = .804 for only the automatically scored responses. The evaluation underscores the framework’s effective adaptivity and operational feasibility with its shares of used components varying substantially across domains and languages while maintaining homogeneously high accuracy.

Co-authors

Fabian Zehner 1

Torsten Zesch 1

Venues

bea1
ws1

Fix author