ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification

Kathrein Abu Kwaik, Motaz Saad


Abstract
In this paper, we present a Dialect Identification system (ArbDialectID) that competed at Task 1 of the MADAR shared task, MADARTravel Domain Dialect Identification. We build a course and a fine-grained identification model to predict the label (corresponding to a dialect of Arabic) of a given text. We build two language models by extracting features at two levels (words and characters). We firstly build a coarse identification model to classify each sentence into one out of six dialects, then use this label as a feature for the fine-grained model that classifies the sentence among 26 dialects from different Arab cities, after that we apply ensemble voting classifier on both sub-systems. Our system ranked 1st that achieving an f-score of 67.32%. Both the models and our feature engineering tools are made available to the research community.
Anthology ID:
W19-4632
Volume:
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Month:
August
Year:
2019
Address:
Florence, Italy
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
254–258
Language:
URL:
https://aclanthology.org/W19-4632
DOI:
10.18653/v1/W19-4632
Bibkey:
Cite (ACL):
Kathrein Abu Kwaik and Motaz Saad. 2019. ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 254–258, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification (Abu Kwaik & Saad, WANLP 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/W19-4632.pdf