Malte Sternik


2026

This paper explores the application of a Large Language Model (LLM) using k-shot prompting with large k for automatically scoring a German Written Elicited Imitation Test (WEIT), a test for assessing literacy-dependent procedural knowledge in German as a foreign language. In this test, test-takers are briefly presented with written sentences which they then have to reproduce in writing as accurately as possible. The responses are scored on an ordinal scale which differentiates between different types of errors (e.g. lexical vs. grammatical). We find that with increasing k (in a range from 1 to 700) accuracy increases significantly but it also depends on the drawn sample and varies across different runs of the same prompt. Overall, the k-shot setting which relies on in-context learning without being provided with the scoring rubric outperforms a baseline where only the scoring rubric is provided to the model. However, the LLM does not outperform previous results based on rule-based or BERT-based models.