On The Arabic Dialects’ Identification: Overcoming Challenges of Geographical Similarities Between Arabic dialects and Imbalanced Datasets

Salma Jamal; Aly M .Kassem; Omar Mohamed; Ali Ashraf

doi:10.18653/v1/2022.wanlp-1.49

On The Arabic Dialects’ Identification: Overcoming Challenges of Geographical Similarities Between Arabic dialects and Imbalanced Datasets

Salma Jamal, Aly M .Kassem, Omar Mohamed, Ali Ashraf

Abstract

Arabic is one of the world’s richest languages, with a diverse range of dialects based on geographical origin. In this paper, we present a solution to tackle subtask 1 (Country-level dialect identification) of the Nuanced Arabic Dialect Identification (NADI) shared task 2022 achieving third place with an average macro F1 score between the two test sets of 26.44%. In the preprocessing stage, we removed the most common frequent terms from all sentences across all dialects, and in the modeling step, we employed a hybrid loss function approach that includes Weighted cross entropy loss and Vector Scaling(VS) Loss. On test sets A and B, our model achieved 35.68% and 17.192% Macro F1 scores, respectively.

Anthology ID:: 2022.wanlp-1.49
Volume:: Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Month:: December
Year:: 2022
Address:: Abu Dhabi, United Arab Emirates (Hybrid)
Editors:: Houda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Nadi Tomeh, Salam Khalifa, Wajdi Zaghouani
Venue:: WANLP
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 458–463
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2022.wanlp-1.49/
DOI:: 10.18653/v1/2022.wanlp-1.49
Bibkey:
Cite (ACL):: Salma Jamal, Aly M .Kassem, Omar Mohamed, and Ali Ashraf. 2022. On The Arabic Dialects’ Identification: Overcoming Challenges of Geographical Similarities Between Arabic dialects and Imbalanced Datasets. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), pages 458–463, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):: On The Arabic Dialects’ Identification: Overcoming Challenges of Geographical Similarities Between Arabic dialects and Imbalanced Datasets (Jamal et al., WANLP 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2022.wanlp-1.49.pdf

PDF Cite Search Fix data