Reem Alatrash


2021

pdf
Regression Analysis of Lexical and Morpho-Syntactic Properties of Kiezdeutsch
Diego Frassinelli | Gabriella Lapesa | Reem Alatrash | Dominik Schlechtweg | Sabine Schulte im Walde
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

Kiezdeutsch is a variety of German predominantly spoken by teenagers from multi-ethnic urban neighborhoods in casual conversations with their peers. In recent years, the popularity of Kiezdeutsch has increased among young people, independently of their socio-economic origin, and has spread in social media, too. While previous studies have extensively investigated this language variety from a linguistic and qualitative perspective, not much has been done from a quantitative point of view. We perform the first large-scale data-driven analysis of the lexical and morpho-syntactic properties of Kiezdeutsch in comparison with standard German. At the level of results, we confirm predictions of previous qualitative analyses and integrate them with further observations on specific linguistic phenomena such as slang and self-centered speaker attitude. At the methodological level, we provide logistic regression as a framework to perform bottom-up feature selection in order to quantify differences across language varieties.

2020

pdf
CCOHA: Clean Corpus of Historical American English
Reem Alatrash | Dominik Schlechtweg | Jonas Kuhn | Sabine Schulte im Walde
Proceedings of the Twelfth Language Resources and Evaluation Conference

Modelling language change is an increasingly important area of interest within the fields of sociolinguistics and historical linguistics. In recent years, there has been a growing number of publications whose main concern is studying changes that have occurred within the past centuries. The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. This paper describes methods applied to the downloadable version of the COHA corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties. The resulting corpus CCOHA contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.