The annotation of learner language is an often ambiguous and challenging task. It is therefore surprising that in Second Language Acquisition research, information on annotation quality is hardly ever published. This is also true for verb placement, a linguistic feature that has re- ceived much attention within SLA. This paper presents an annotation on verb placement in German learner texts at different proficiency levels. We argue that as part of the annotation process target hypotheses should be provided as ancillary annotations that make explicit each annotator’s interpretation of a learner sentence. Our study demonstrates that verb placement can be annotated with high agreement between multiple annotators, for texts at all proficiency levels and across sentences of varying complex- ity. We release our corpus with annotations by four annotators on more than 600 finite clauses sampled across 5 CEFR levels.
Developmental stages are a linguistic concept claiming that language learning, despite its large inter-individual variance, generally progresses in an ordered, step-like manner. At the core of research has been the acquisition of verb placement by learners, as conceptualized within Processability Theory (Pienemann, 1989). The computational implementation of a system detecting developmental stages is a prerequisite for an automated analysis of L2 language development. However, such an implementation faces two main challenges. The first is the lack of a fully fleshed out, coherent linguistic specification of the stages. The second concerns the translation of the linguistic specification into computational procedures that can extract clauses from learner-produced text and assign them to a developmental stage based on verb placement. Our contribution provides the necessary linguistic specification of the stages as well as detaiiled discussion and recommendations regarding computational implementation.
The MERLIN corpus is a written learner corpus for Czech, German,and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR) with authentic learner data. The corpus contains 2,290 learner texts produced in standardized language certifications covering CEFR levels A1-C1. The MERLIN annotation scheme includes a wide range of language characteristics that enable research into the empirical foundations of the CEFR scales and provide language teachers, test developers, and Second Language Acquisition researchers with concrete examples of learner performance and progress across multiple proficiency levels. For computational linguistics, it provide a range of authentic learner data for three target languages, supporting a broadening of the scope of research in areas such as automatic proficiency classification or native language identification. The annotated corpus and related information will be freely available as a corpus resource and through a freely accessible, didactically-oriented online platform.