Abstract
In this paper, we present a Dialect Identification system (ArbDialectID) that competed at Task 1 of the MADAR shared task, MADARTravel Domain Dialect Identification. We build a course and a fine-grained identification model to predict the label (corresponding to a dialect of Arabic) of a given text. We build two language models by extracting features at two levels (words and characters). We firstly build a coarse identification model to classify each sentence into one out of six dialects, then use this label as a feature for the fine-grained model that classifies the sentence among 26 dialects from different Arab cities, after that we apply ensemble voting classifier on both sub-systems. Our system ranked 1st that achieving an f-score of 67.32%. Both the models and our feature engineering tools are made available to the research community.- Anthology ID:
- W19-4632
- Volume:
- Proceedings of the Fourth Arabic Natural Language Processing Workshop
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Venue:
- WANLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 254–258
- Language:
- URL:
- https://aclanthology.org/W19-4632
- DOI:
- 10.18653/v1/W19-4632
- Cite (ACL):
- Kathrein Abu Kwaik and Motaz Saad. 2019. ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 254–258, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification (Abu Kwaik & Saad, WANLP 2019)
- PDF:
- https://preview.aclanthology.org/paclic-22-ingestion/W19-4632.pdf