Optimizing Naive Bayes for Arabic Dialect Identification

Tommi Jauhiainen, Heidi Jauhiainen, Krister Lindén


Abstract
This article describes the language identification system used by the SUKI team in the 2022 Nuanced Arabic Dialect Identification (NADI) shared task. In addition to the system description, we give some details of the dialect identification experiments we conducted while preparing our submissions. In the end, we submitted only one official run. We used a Naive Bayes-based language identifier with character n-grams from one to four, of which we implemented a new version, which automatically optimizes its parameters. We also experimented with clustering the training data according to different topics. With the macro F1 score of 0.1963 on test set A and 0.1058 on test set B, we achieved the 18th position out of the 19 competing teams.
Anthology ID:
2022.wanlp-1.40
Volume:
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Houda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Nadi Tomeh, Salam Khalifa, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
409–414
Language:
URL:
https://aclanthology.org/2022.wanlp-1.40
DOI:
10.18653/v1/2022.wanlp-1.40
Bibkey:
Cite (ACL):
Tommi Jauhiainen, Heidi Jauhiainen, and Krister Lindén. 2022. Optimizing Naive Bayes for Arabic Dialect Identification. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), pages 409–414, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Optimizing Naive Bayes for Arabic Dialect Identification (Jauhiainen et al., WANLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2022.wanlp-1.40.pdf