Abstract
We present MICHAEL, a simple lightweight method for automatic Arabic Dialect Identification on the MADAR travel domain Dialect Identification (DID). MICHAEL uses simple character-level features in order to perform a pre-processing free classification. More precisely, Character N-grams extracted from the original sentences are used to train a Multinomial Naive Bayes classifier. This system achieved an official score (accuracy) of 53.25% with 1<=N<=3 but showed a much better result with character 4-grams (62.17% accuracy).- Anthology ID:
- W19-4627
- Volume:
- Proceedings of the Fourth Arabic Natural Language Processing Workshop
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Wassim El-Hajj, Lamia Hadrich Belguith, Fethi Bougares, Walid Magdy, Imed Zitouni, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
- Venue:
- WANLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 229–233
- Language:
- URL:
- https://aclanthology.org/W19-4627
- DOI:
- 10.18653/v1/W19-4627
- Cite (ACL):
- Dhaou Ghoul and Gaël Lejeune. 2019. MICHAEL: Mining Character-level Patterns for Arabic Dialect Identification (MADAR Challenge). In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 229–233, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- MICHAEL: Mining Character-level Patterns for Arabic Dialect Identification (MADAR Challenge) (Ghoul & Lejeune, WANLP 2019)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/W19-4627.pdf