MICHAEL: Mining Character-level Patterns for Arabic Dialect Identification (MADAR Challenge)

Dhaou Ghoul, Gaël Lejeune


Abstract
We present MICHAEL, a simple lightweight method for automatic Arabic Dialect Identification on the MADAR travel domain Dialect Identification (DID). MICHAEL uses simple character-level features in order to perform a pre-processing free classification. More precisely, Character N-grams extracted from the original sentences are used to train a Multinomial Naive Bayes classifier. This system achieved an official score (accuracy) of 53.25% with 1<=N<=3 but showed a much better result with character 4-grams (62.17% accuracy).
Anthology ID:
W19-4627
Volume:
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Wassim El-Hajj, Lamia Hadrich Belguith, Fethi Bougares, Walid Magdy, Imed Zitouni, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
229–233
Language:
URL:
https://aclanthology.org/W19-4627
DOI:
10.18653/v1/W19-4627
Bibkey:
Cite (ACL):
Dhaou Ghoul and Gaël Lejeune. 2019. MICHAEL: Mining Character-level Patterns for Arabic Dialect Identification (MADAR Challenge). In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 229–233, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
MICHAEL: Mining Character-level Patterns for Arabic Dialect Identification (MADAR Challenge) (Ghoul & Lejeune, WANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/W19-4627.pdf