Vanilla Classifiers for Distinguishing between Similar Languages

Sergiu Nisioi, Alina Maria Ciobanu, Liviu P. Dinu


Abstract
In this paper we describe the submission of the UniBuc-NLP team for the Discriminating between Similar Languages Shared Task, DSL 2016. We present and analyze the results we obtained in the closed track of sub-task 1 (Similar languages and language varieties) and sub-task 2 (Arabic dialects). For sub-task 1 we used a logistic regression classifier with tf-idf feature weighting and for sub-task 2 a character-based string kernel with an SVM classifier. Our results show that good accuracy scores can be obtained with limited feature and model engineering. While certain limitations are to be acknowledged, our approach worked surprisingly well for out-of-domain, social media data, with 0.898 accuracy (3rd place) for dataset B1 and 0.838 accuracy (4th place) for dataset B2.
Anthology ID:
W16-4830
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
235–242
Language:
URL:
https://aclanthology.org/W16-4830
DOI:
Bibkey:
Cite (ACL):
Sergiu Nisioi, Alina Maria Ciobanu, and Liviu P. Dinu. 2016. Vanilla Classifiers for Distinguishing between Similar Languages. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 235–242, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Vanilla Classifiers for Distinguishing between Similar Languages (Nisioi et al., VarDial 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/W16-4830.pdf