Doreen Osmelak
2026
Systematicity between Forms and Meanings across Languages Supports Efficient Communication
Doreen Osmelak | Yang Xu | Michael Hahn | Kate McCurdy
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Doreen Osmelak | Yang Xu | Michael Hahn | Kate McCurdy
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Languages vary widely in how meanings map to word forms. These mappings have been found to support efficient communication; however, this theory does not account for systematic relations within word forms. We examine how a restricted set of grammatical meanings (e.g. person, number) are expressed on verbs and pronouns across typologically diverse languages. Consistent with prior work, we find that verb and pronoun forms are shaped by competing communicative pressures for simplicity (minimizing the inventory of grammatical distinctions) and accuracy (enabling recovery of intended meanings). Crucially, our proposed model uses a novel measure of complexity (inverse of simplicity) based on the learnability of meaning-to-form mappings. This innovation captures fine-grained regularities in linguistic form, allowing better discrimination between attested and unattested systems, and establishes a new connection from efficient communication theory to systematicity in natural language.
2023
The Denglisch Corpus of German-English Code-Switching
Doreen Osmelak | Shuly Wintner
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Doreen Osmelak | Shuly Wintner
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
When multilingual speakers involve in a conversation they inevitably introduce code-switching (CS), i.e., mixing of more than one language between and within utterances. CS is still an understudied phenomenon, especially in the written medium, and relatively few computational resources for studying it are available. We describe a corpus of German-English code-switching in social media interactions. We focus on some challenges in annotating CS, especially due to words whose language ID cannot be easily determined. We introduce a novel schema for such word-level annotation, with which we manually annotated a subset of the corpus. We then trained classifiers to predict and identify switches, and applied them to the remainder of the corpus. Thereby, we created a large scale corpus of German-English mixed utterances with precise indications of CS points.
Shared Lexical Items as Triggers of Code Switching
Shuly Wintner | Safaa Shehadi | Yuli Zeira | Doreen Osmelak | Yuval Nov
Transactions of the Association for Computational Linguistics, Volume 11
Shuly Wintner | Safaa Shehadi | Yuli Zeira | Doreen Osmelak | Yuval Nov
Transactions of the Association for Computational Linguistics, Volume 11
Why do bilingual speakers code-switch (mix their two languages)? Among the several theories that attempt to explain this natural and ubiquitous phenomenon, the triggering hypothesis relates code-switching to the presence of lexical triggers, specifically cognates and proper names, adjacent to the switch point. We provide a fuller, more nuanced and refined exploration of the triggering hypothesis, based on five large datasets in three language pairs, reflecting both spoken and written bilingual interactions. Our results show that words that are assumed to reside in a mental lexicon shared by both languages indeed trigger code-switching, that the tendency to switch depends on the distance of the trigger from the switch point and on whether the trigger precedes or succeeds the switch, but not on the etymology of the trigger words. We thus provide strong, robust, evidence-based confirmation to several hypotheses on the relationships between lexical triggers and code-switching.