Arabic dialect identification: An Arabic-BERT model with data augmentation and ensembling strategy

Kamel Gaanoun; Imade Benelallam

Arabic dialect identification: An Arabic-BERT model with data augmentation and ensembling strategy

Abstract

This paper presents the ArabicProcessors team’s deep learning system designed for the NADI 2020 Subtask 1 (country-level dialect identification) and Subtask 2 (province-level dialect identification). We used Arabic-Bert in combination with data augmentation and ensembling methods. Unlabeled data provided by task organizers (10 Million tweets) was split into multiple subparts, to which we applied semi-supervised learning method, and finally ran a specific ensembling process on the resulting models. This system ranked 3rd in Subtask 1 with 23.26% F1-score and 2nd in Subtask 2 with 5.75% F1-score.

Anthology ID:: 2020.wanlp-1.28
Volume:: Proceedings of the Fifth Arabic Natural Language Processing Workshop
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Editors:: Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
Venue:: WANLP
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 275–281
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.wanlp-1.28/
DOI:
Bibkey:
Cite (ACL):: Kamel Gaanoun and Imade Benelallam. 2020. Arabic dialect identification: An Arabic-BERT model with data augmentation and ensembling strategy. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 275–281, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):: Arabic dialect identification: An Arabic-BERT model with data augmentation and ensembling strategy (Gaanoun & Benelallam, WANLP 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2020.wanlp-1.28.pdf

PDF Cite Search Fix data