In this paper, we present our work for the NADI Shared Task (Abdul-Mageed and Habash, 2020): Nuanced Arabic Dialect Identification for Subtask-1: country-level dialect identification. We introduce a Reverse Translation Corpus Extension Systems (RTCES) to handle data imbalance along with reported results on several experimented approaches of word and document representations and different models architectures. The top scoring model was based on AraBERT (Antoun et al., 2020), with our modified extended corpus based on reverse translation of the given Arabic tweets. The selected system achieved a macro average F1 score of 20.34% on the test set, which places us as the 7th out of 18 teams in the final ranking Leaderboard.
In the last few years, deep learning has proved to be a very effective paradigm to discover patterns in large data sets. Unfortunately, deep learning training on small data sets is not the best option because most of the time traditional machine learning algorithms could get better scores. Now, we can train the neural network on a large data set then fine-tune on a smaller data set using the transfer learning technique. In this paper, we present our system for NADI shared Task: Country-level Dialect Identification, Our system is based on fine-tuning of BERT and it achieves 22.85 F1-score on Test Set and our rank is 5th out of 18 teams.
This paper describes the system we built for SemEval-2020 task 3. That is predicting the scores of similarity for a pair of words within two different contexts. Our system is based on both BERT embeddings and WordNet. We simply use cosine similarity to find the closest synset of the target words. Our results show that using this simple approach greatly improves the system behavior. Our model is ranked 3rd in subtask-2 for SemEval-2020 task 3.
Social media platforms, online news commenting spaces, and many other public forums have become widely known for issues of abusive behavior such as cyber-bullying and personal attacks. In this paper, we use the annotated tweets of the Offensive Language Identification Dataset (OLID) to train three levels of deep learning classifiers to solve the three sub-tasks associated with the dataset. Sub-task A is to determine if the tweet is toxic or not. Then, for offensive tweets, sub-task B requires determining whether the toxicity is targeted. Finally, for sub-task C, we predict the target of the offense; i.e. a group, individual, or other entity. In our solution, we tackle the problem of class imbalance in the dataset by using back translation for data augmentation and utilizing the fine-tuned BERT model in an ensemble of deep learning classifiers. We used this solution to participate in the three English sub-tasks of SemEval-2020 task 12. The proposed solution achieved 0.91393, 0.6300, and 0.57607 macro F1-average in sub-tasks A, B, and C respectively. We achieved the 9th, 14th, and 22nd places for sub-tasks A, B and C respectively.
Studies on Dialectical Arabic are growing more important by the day as it becomes the primary written and spoken form of Arabic online in informal settings. Among the important problems that should be explored is that of dialect identification. This paper reports different techniques that can be applied towards such goal and reports their performance on the Multi Arabic Dialect Applications and Resources (MADAR) Arabic Dialect Corpora. Our results show that improving on traditional systems using frequency based features and non deep learning classifiers is a challenging task. We propose different models based on different word and document representations. Our top model is able to achieve an F1 macro averaged score of 65.66 on MADAR’s small-scale parallel corpus of 25 dialects and Modern Standard Arabic (MSA).
The model submitted works as follows. When supplied a question and a passage it makes use of the BERT embedding along with the hierarchical attention model which consists of 2 parts, the co-attention and the self-attention, to locate a continuous span of the passage that is the answer to the question.
In this paper, we address the problem of finding a novel document descriptor based on the covariance matrix of the word vectors of a document. Our descriptor has a fixed length, which makes it easy to use in many supervised and unsupervised applications. We tested our novel descriptor in different tasks including supervised and unsupervised settings. Our evaluation shows that our document covariance descriptor fits different tasks with competitive performance against state-of-the-art methods.
In this paper we describe our QU-BIGIR system for the Arabic subtask D of the SemEval 2017 Task 3. Our approach builds on our participation in the past version of the same subtask. This year, our system uses different similarity measures that encodes lexical and semantic pairwise similarity of text pairs. In addition to well known similarity measures such as cosine similarity, we use other measures based on the summary statistics of word embedding representation for a given text. To rank a list of candidate question answer pairs for a given question, we learn a linear SVM classifier over our similarity features. Our best resulting run came second in subtask D with a very competitive performance to the first-ranking system.