Vlad Ștefănescu

2020

pdf bib abs
Applying Multilingual and Monolingual Transformer-Based Models for Dialect Identification
Cristian Popa | Vlad Ștefănescu
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

We study the ability of large fine-tuned transformer models to solve a binary classification task of dialect identification, with a special interest in comparing the performance of multilingual to monolingual ones. The corpus analyzed contains Romanian and Moldavian samples from the news domain, as well as tweets for assessing the performance. We find that the monolingual models are superior to the multilingual ones and the best results are obtained using an SVM ensemble of 5 different transformer-based models. We provide our experimental results and an analysis of the attention mechanisms of the best-performing individual classifiers to explain their decisions. The code we used was released under an open-source license.

Co-authors

Cristian Popa 1

Venues

vardial1

Fix data