Detecting de minimis Code-Switching in Historical German Books

Shijia Liu, David Smith


Abstract
Code-switching has long interested linguists, with computational work in particular focusing on speech and social media data (Sitaram et al., 2019). This paper contrasts these informal instances of code-switching to its appearance in more formal registers, by examining the mixture of languages in the Deutsches Textarchiv (DTA), a corpus of 1406 primarily German books from the 17th to 19th centuries. We automatically annotate and manually inspect spans of six embedded languages (Latin, French, English, Italian, Spanish, and Greek) in the corpus. We quantitatively analyze the differences between code-switching patterns in these books and those in more typically studied speech and social media corpora. Furthermore, we address the practical task of predicting code-switching from features of the matrix language alone in the DTA corpus. Such classifiers can help reduce errors when optical character recognition or speech transcription is applied to a large corpus with rare embedded languages.
Anthology ID:
2020.coling-main.163
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
1808–1814
Language:
URL:
https://aclanthology.org/2020.coling-main.163
DOI:
10.18653/v1/2020.coling-main.163
Bibkey:
Cite (ACL):
Shijia Liu and David Smith. 2020. Detecting de minimis Code-Switching in Historical German Books. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1808–1814, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Detecting de minimis Code-Switching in Historical German Books (Liu & Smith, COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2020.coling-main.163.pdf