Toward Multilingual Identification of Online Registers

Veronika Laippala, Roosa Kyllönen, Jesse Egbert, Douglas Biber, Sampo Pyysalo


Abstract
We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE). Using CORE and the newly introduced corpus, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convolutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.
Anthology ID:
W19-6130
Volume:
Proceedings of the 22nd Nordic Conference on Computational Linguistics
Month:
September–October
Year:
2019
Address:
Turku, Finland
Venue:
NoDaLiDa
SIG:
Publisher:
Linköping University Electronic Press
Note:
Pages:
292–297
Language:
URL:
https://aclanthology.org/W19-6130
DOI:
Bibkey:
Cite (ACL):
Veronika Laippala, Roosa Kyllönen, Jesse Egbert, Douglas Biber, and Sampo Pyysalo. 2019. Toward Multilingual Identification of Online Registers. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages 292–297, Turku, Finland. Linköping University Electronic Press.
Cite (Informal):
Toward Multilingual Identification of Online Registers (Laippala et al., NoDaLiDa 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W19-6130.pdf