Abstract
This paper describes Faheem (adj. of understand), our submission to NADI (Nuanced Arabic Dialect Identification) shared task. With so many Arabic dialects being under-studied due to the scarcity of the resources, the objective is to identify the Arabic dialect used in the tweet, country wise. We propose a machine learning approach where we utilize word-level n-gram (n = 1 to 3) and tf-idf features and feed them to six different classifiers. We train the system using a data set of 21,000 tweets—provided by the organizers—covering twenty-one Arab countries. Our top performing classifiers are: Logistic Regression, Support Vector Machines, and Multinomial Na ̈ıve Bayes.- Anthology ID:
- 2020.wanlp-1.29
- Volume:
- Proceedings of the Fifth Arabic Natural Language Processing Workshop
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Editors:
- Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
- Venue:
- WANLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 282–287
- Language:
- URL:
- https://aclanthology.org/2020.wanlp-1.29
- DOI:
- Cite (ACL):
- Nouf AlShenaifi and Aqil Azmi. 2020. Faheem at NADI shared task: Identifying the dialect of Arabic tweet. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 282–287, Barcelona, Spain (Online). Association for Computational Linguistics.
- Cite (Informal):
- Faheem at NADI shared task: Identifying the dialect of Arabic tweet (AlShenaifi & Azmi, WANLP 2020)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2020.wanlp-1.29.pdf