DanTok: Domain Beats Language for Danish Social Media POS Tagging
Kia Kirstein Hansen, Maria Barrett, Max Müller-Eberstein, Cathrine Damgaard, Trine Eriksen, Rob van der Goot
Abstract
Language from social media remains challenging to process automatically, especially for non-English languages. In this work, we introduce the first NLP dataset for TikTok comments and the first Danish social media dataset with part-of-speech annotation. We further supply annotations for normalization, code-switching, and annotator uncertainty. As transferring models to such a highly specialized domain is non-trivial, we conduct an extensive study into which source data and modeling decisions most impact the performance. Surprisingly, transferring from in-domain data, even from a different language, outperforms in-language, out-of-domain training. These benefits nonetheless rely on the underlying language models having been at least partially pre-trained on data from the target language. Using our additional annotation layers, we further analyze how normalization, code-switching, and human uncertainty affect the tagging accuracy.- Anthology ID:
- 2023.nodalida-1.27
- Volume:
- Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
- Month:
- May
- Year:
- 2023
- Address:
- Tórshavn, Faroe Islands
- Editors:
- Tanel Alumäe, Mark Fishel
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- University of Tartu Library
- Note:
- Pages:
- 271–279
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.nodalida-1.27/
- DOI:
- Cite (ACL):
- Kia Kirstein Hansen, Maria Barrett, Max Müller-Eberstein, Cathrine Damgaard, Trine Eriksen, and Rob van der Goot. 2023. DanTok: Domain Beats Language for Danish Social Media POS Tagging. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 271–279, Tórshavn, Faroe Islands. University of Tartu Library.
- Cite (Informal):
- DanTok: Domain Beats Language for Danish Social Media POS Tagging (Kirstein Hansen et al., NoDaLiDa 2023)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.nodalida-1.27.pdf