DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis

Sarah Jablotschkin, Elke Teich, Heike Zinsmeister


Abstract
In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.
Anthology ID:
2024.ltedi-1.9
Volume:
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion
Month:
March
Year:
2024
Address:
St. Julian's, Malta
Editors:
Bharathi Raja Chakravarthi, Bharathi B, Paul Buitelaar, Thenmozhi Durairaj, György Kovács, Miguel Ángel García Cumbreras
Venues:
LTEDI | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
106–117
Language:
URL:
https://aclanthology.org/2024.ltedi-1.9
DOI:
Bibkey:
Cite (ACL):
Sarah Jablotschkin, Elke Teich, and Heike Zinsmeister. 2024. DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis. In Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion, pages 106–117, St. Julian's, Malta. Association for Computational Linguistics.
Cite (Informal):
DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis (Jablotschkin et al., LTEDI-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/2024.ltedi-1.9.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-3/2024.ltedi-1.9.mp4