2019
pdf
abs
QC-GO Submission for MADAR Shared Task: Arabic Fine-Grained Dialect Identification
Younes Samih
|
Hamdy Mubarak
|
Ahmed Abdelali
|
Mohammed Attia
|
Mohamed Eldesouki
|
Kareem Darwish
Proceedings of the Fourth Arabic Natural Language Processing Workshop
This paper describes the QC-GO team submission to the MADAR Shared Task Subtask 1 (travel domain dialect identification) and Subtask 2 (Twitter user location identification). In our participation in both subtasks, we explored a number of approaches and system combinations to obtain the best performance for both tasks. These include deep neural nets and heuristics. Since individual approaches suffer from various shortcomings, the combination of different approaches was able to fill some of these gaps. Our system achieves F1-Scores of 66.1% and 67.0% on the development sets for Subtasks 1 and 2 respectively.
pdf
abs
A System for Diacritizing Four Varieties of Arabic
Hamdy Mubarak
|
Ahmed Abdelali
|
Kareem Darwish
|
Mohamed Eldesouki
|
Younes Samih
|
Hassan Sajjad
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations
Short vowels, aka diacritics, are more often omitted when writing different varieties of Arabic including Modern Standard Arabic (MSA), Classical Arabic (CA), and Dialectal Arabic (DA). However, diacritics are required to properly pronounce words, which makes diacritic restoration (a.k.a. diacritization) essential for language learning and text-to-speech applications. In this paper, we present a system for diacritizing MSA, CA, and two varieties of DA, namely Moroccan and Tunisian. The system uses a character level sequence-to-sequence deep learning model that requires no feature engineering and beats all previous SOTA systems for all the Arabic varieties that we test on.
2018
pdf
Multi-Dialect Arabic POS Tagging: A CRF Approach
Kareem Darwish
|
Hamdy Mubarak
|
Ahmed Abdelali
|
Mohamed Eldesouki
|
Younes Samih
|
Randah Alharbi
|
Mohammed Attia
|
Walid Magdy
|
Laura Kallmeyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
abs
Learning from Relatives: Unified Dialectal Arabic Segmentation
Younes Samih
|
Mohamed Eldesouki
|
Mohammed Attia
|
Kareem Darwish
|
Ahmed Abdelali
|
Hamdy Mubarak
|
Laura Kallmeyer
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)
Arabic dialects do not just share a common koiné, but there are shared pan-dialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accuracies than dialect-specific models, eliminating the need for dialect identification before segmentation. We also measure the degree of relatedness between four major Arabic dialects by testing how a segmentation model trained on one dialect performs on the other dialects. We found that linguistic relatedness is contingent with geographical proximity. In our experiments we use SVM-based ranking and bi-LSTM-CRF sequence labeling.
pdf
abs
A Neural Architecture for Dialectal Arabic Segmentation
Younes Samih
|
Mohammed Attia
|
Mohamed Eldesouki
|
Ahmed Abdelali
|
Hamdy Mubarak
|
Laura Kallmeyer
|
Kareem Darwish
Proceedings of the Third Arabic Natural Language Processing Workshop
The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neural networks without any normalization or use of lexical features or lexical resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that rely on additional resources.
pdf
abs
Arabic POS Tagging: Don’t Abandon Feature Engineering Just Yet
Kareem Darwish
|
Hamdy Mubarak
|
Ahmed Abdelali
|
Mohamed Eldesouki
Proceedings of the Third Arabic Natural Language Processing Workshop
This paper focuses on comparing between using Support Vector Machine based ranking (SVM-Rank) and Bidirectional Long-Short-Term-Memory (bi-LSTM) neural-network based sequence labeling in building a state-of-the-art Arabic part-of-speech tagging system. Using SVM-Rank leads to state-of-the-art results, but with a fair amount of feature engineering. Using bi-LSTM, particularly when combined with word embeddings, may lead to competitive POS-tagging results by automatically deducing latent linguistic features. However, we show that augmenting bi-LSTM sequence labeling with some of the features that we used for the SVM-Rank based tagger yields to further improvements. We also show that gains that realized by using embeddings may not be additive with the gains achieved by the features. We are open-sourcing both the SVM-Rank and the bi-LSTM based systems for free.
2016
pdf
abs
QCRI @ DSL 2016: Spoken Arabic Dialect Identification Using Textual Features
Mohamed Eldesouki
|
Fahim Dalvi
|
Hassan Sajjad
|
Kareem Darwish
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
The paper describes the QCRI submissions to the task of automatic Arabic dialect classification into 5 Arabic variants, namely Egyptian, Gulf, Levantine, North-African, and Modern Standard Arabic (MSA). The training data is relatively small and is automatically generated from an ASR system. To avoid over-fitting on such small data, we carefully selected and designed the features to capture the morphological essence of the different dialects. We submitted four runs to the Arabic sub-task. For all runs, we used a combined feature vector of character bi-grams, tri-grams, 4-grams, and 5-grams. We tried several machine-learning algorithms, namely Logistic Regression, Naive Bayes, Neural Networks, and Support Vector Machines (SVM) with linear and string kernels. However, our submitted runs used SVM with a linear kernel. In the closed submission, we got the best accuracy of 0.5136 and the third best weighted F1 score, with a difference less than 0.002 from the highest score.