Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification

Abdellah El Mekki, Ahmed Alami, Hamza Alami, Ahmed Khoumsi, Ismail Berrada


Abstract
Around the Arab world, different Arabic dialects are spoken by more than 300M persons, and are increasingly popular in social media texts. However, Arabic dialects are considered to be low-resource languages, limiting the development of machine-learning based systems for these dialects. In this paper, we investigate the Arabic dialect identification task, from two perspectives: country-level dialect identification from 21 Arab countries, and province-level dialect identification from 100 provinces. We introduce an unified pipeline of state-of-the-art models, that can handle the two subtasks. Our experimental studies applied to the NADI shared task, show promising results both at the country-level (F1-score of 25.99%) and the province-level (F1-score of 6.39%), and thus allow us to be ranked 2nd for the country-level subtask, and 1st in the province-level subtask.
Anthology ID:
2020.wanlp-1.27
Volume:
Proceedings of the Fifth Arabic Natural Language Processing Workshop
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
268–274
Language:
URL:
https://aclanthology.org/2020.wanlp-1.27
DOI:
Bibkey:
Cite (ACL):
Abdellah El Mekki, Ahmed Alami, Hamza Alami, Ahmed Khoumsi, and Ismail Berrada. 2020. Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 268–274, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):
Weighted combination of BERT and N-GRAM features for Nuanced Arabic Dialect Identification (El Mekki et al., WANLP 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2020.wanlp-1.27.pdf