Abstract
This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.- Anthology ID:
- 2023.vardial-1.3
- Volume:
- Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
- Month:
- May
- Year:
- 2023
- Address:
- Dubrovnik, Croatia
- Editors:
- Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
- Venue:
- VarDial
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 31–39
- Language:
- URL:
- https://aclanthology.org/2023.vardial-1.3
- DOI:
- 10.18653/v1/2023.vardial-1.3
- Cite (ACL):
- Olli Kuparinen. 2023. Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish Tweets. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 31–39, Dubrovnik, Croatia. Association for Computational Linguistics.
- Cite (Informal):
- Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish Tweets (Kuparinen, VarDial 2023)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2023.vardial-1.3.pdf