Abstract
We present the Arabic dialect identification system that we used for the country-level subtask of the NADI challenge. Our model consists of three components: BiLSTM-CNN, character-level TF-IDF, and topic modeling features. We represent each tweet using these features and feed them into a deep neural network. We then add an effective heuristic that improves the overall performance. We achieved an F1-Macro score of 20.77% and an accuracy of 34.32% on the test set. The model was also evaluated on the Arabic Online Commentary dataset, achieving results better than the state-of-the-art.- Anthology ID:
- 2020.wanlp-1.31
- Volume:
- Proceedings of the Fifth Arabic Natural Language Processing Workshop
- Month:
- December
- Year:
- 2020
- Address:
- Barcelona, Spain (Online)
- Editors:
- Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
- Venue:
- WANLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 295–301
- Language:
- URL:
- https://aclanthology.org/2020.wanlp-1.31
- DOI:
- Cite (ACL):
- Abdulrahman Aloraini, Massimo Poesio, and Ayman Alhelbawy. 2020. The QMUL/HRBDT contribution to the NADI Arabic Dialect Identification Shared Task. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 295–301, Barcelona, Spain (Online). Association for Computational Linguistics.
- Cite (Informal):
- The QMUL/HRBDT contribution to the NADI Arabic Dialect Identification Shared Task (Aloraini et al., WANLP 2020)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2020.wanlp-1.31.pdf