Active Learning for Multidialectal Arabic POS Tagging

Diyam Akra; Mohammed Khalilia; Mustafa Jarrar

doi:10.18653/v1/2025.findings-emnlp.1359

Active Learning for Multidialectal Arabic POS Tagging

Diyam Akra, Mohammed Khalilia, Mustafa Jarrar

Abstract

Multidialectal Arabic POS tagging is challenging due to the morphological richness and high variability among dialects. While POS tagging for MSA has advanced thanks to the availability of annotated datasets, creating similar resources for dialects remains costly and labor-intensive. Increasing the size of annotated datasets does not necessarily result in better performance. Active learning offers a more efficient alternative by prioritizing annotating the most informative samples. This paper proposes an active learning approach for multidialectal Arabic POS tagging. Our experiments revealed that annotating approximately 15,000 tokens is sufficient for high performance. We further demonstrate that using a fine-tuned model from one dialect to guide the selection of initial samples from another dialect accelerates convergence—reducing the annotation requirement by about 2,000 tokens. In conclusion, we propose an active learning pipeline and demonstrate that, upon reaching its defined stopping point of 16,000 annotated tokens, it achieves an accuracy of 97.6% on the Emirati Corpus.

Anthology ID:: 2025.findings-emnlp.1359
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24960–24973
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1359/
DOI:: 10.18653/v1/2025.findings-emnlp.1359
Bibkey:
Cite (ACL):: Diyam Akra, Mohammed Khalilia, and Mustafa Jarrar. 2025. Active Learning for Multidialectal Arabic POS Tagging. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24960–24973, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Active Learning for Multidialectal Arabic POS Tagging (Akra et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1359.pdf
Checklist:: 2025.findings-emnlp.1359.checklist.pdf

PDF Cite Search Checklist Fix data