QCRI @ DSL 2016: Spoken Arabic Dialect Identification Using Textual Features

Mohamed Eldesouki, Fahim Dalvi, Hassan Sajjad, Kareem Darwish


Abstract
The paper describes the QCRI submissions to the task of automatic Arabic dialect classification into 5 Arabic variants, namely Egyptian, Gulf, Levantine, North-African, and Modern Standard Arabic (MSA). The training data is relatively small and is automatically generated from an ASR system. To avoid over-fitting on such small data, we carefully selected and designed the features to capture the morphological essence of the different dialects. We submitted four runs to the Arabic sub-task. For all runs, we used a combined feature vector of character bi-grams, tri-grams, 4-grams, and 5-grams. We tried several machine-learning algorithms, namely Logistic Regression, Naive Bayes, Neural Networks, and Support Vector Machines (SVM) with linear and string kernels. However, our submitted runs used SVM with a linear kernel. In the closed submission, we got the best accuracy of 0.5136 and the third best weighted F1 score, with a difference less than 0.002 from the highest score.
Anthology ID:
W16-4828
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
221–226
Language:
URL:
https://aclanthology.org/W16-4828
DOI:
Bibkey:
Cite (ACL):
Mohamed Eldesouki, Fahim Dalvi, Hassan Sajjad, and Kareem Darwish. 2016. QCRI @ DSL 2016: Spoken Arabic Dialect Identification Using Textual Features. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 221–226, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
QCRI @ DSL 2016: Spoken Arabic Dialect Identification Using Textual Features (Eldesouki et al., VarDial 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-bitext-workshop/W16-4828.pdf