Max Schellenberg


2025

We present an approach to the automated scoring of a German Written Elicited Imitation Test, designed to assess literacy-dependent procedural knowledge in German as a foreign language. In this test, sentences are briefly displayed on a screen and, after a short pause, test-takers are asked to reproduce the sentence in writing as accurately as possible. Responses are rated on a 5-point ordinal scale, with grammatical errors typically penalized more heavily than lexical deviations. We compare a rule-based model that implements the categories of the scoring rubric through hand-crafted rules, and a deep learning model trained on pairs of stimulus sentences and written responses. Both models achieve promising performance with quadratically weighted kappa (QWK) values around .87. However, their strengths differ – the rule-based model performs better on previously unseen stimulus sentences and at the extremes of the rating scale, while the deep learning model shows advantages in scoring mid-range responses, for which explicit rules are harder to define.