Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora
Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén
Abstract
This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task. We describe the ULI shared task and how the test set was constructed using the Wanca 2017 corpora and texts in different languages from the Leipzig corpora collection. We also provide the results of a baseline language identification experiment conducted using the ULI 2020 dataset.- Anthology ID:
- 2020.vardial-1.16
- Volume:
- Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Venue:
- VarDial
- SIG:
- Publisher:
- International Committee on Computational Linguistics (ICCL)
- Note:
- Pages:
- 173–185
- Language:
- URL:
- https://aclanthology.org/2020.vardial-1.16
- DOI:
- Cite (ACL):
- Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, and Krister Lindén. 2020. Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 173–185, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
- Cite (Informal):
- Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora (Jauhiainen et al., VarDial 2020)
- PDF:
- https://preview.aclanthology.org/auto-file-uploads/2020.vardial-1.16.pdf