Sabit Hassan


2021

pdf bib
ASAD: Arabic Social media Analytics and unDerstanding
Sabit Hassan | Hamdy Mubarak | Ahmed Abdelali | Kareem Darwish
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

This system demonstration paper describes ASAD: Arabic Social media Analysis and unDerstanding, a suite of seven individual modules that allows users to determine dialects, sentiment, news category, offensiveness, hate speech, adult content, and spam in Arabic tweets. The suite is made available through a web API and a web interface where users can enter text or upload files.

pdf bib
QADI: Arabic Dialect Identification in the Wild
Ahmed Abdelali | Hamdy Mubarak | Younes Samih | Sabit Hassan | Kareem Darwish
Proceedings of the Sixth Arabic Natural Language Processing Workshop

Proper dialect identification is important for a variety of Arabic NLP applications. In this paper, we present a method for rapidly constructing a tweet dataset containing a wide range of country-level Arabic dialects —covering 18 different countries in the Middle East and North Africa region. Our method relies on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that either write mainly in Modern Standard Arabic or mostly use vulgar language. The resultant dataset contains 540k tweets from 2,525 users who are evenly distributed across 18 Arab countries. Using intrinsic evaluation, we show that the labels of a set of randomly selected tweets are 91.5% accurate. For extrinsic evaluation, we are able to build effective country level dialect identification on tweets with a macro-averaged F1-score of 60.6% across 18 classes.

pdf bib
Adult Content Detection on Arabic Twitter: Analysis and Experiments
Hamdy Mubarak | Sabit Hassan | Ahmed Abdelali
Proceedings of the Sixth Arabic Natural Language Processing Workshop

With Twitter being one of the most popular social media platforms in the Arab region, it is not surprising to find accounts that post adult content in Arabic tweets; despite the fact that these platforms dissuade users from such content. In this paper, we present a dataset of Twitter accounts that post adult content. We perform an in-depth analysis of the nature of this data and contrast it with normal tweet content. Additionally, we present extensive experiments with traditional machine learning models, deep neural networks and contextual embeddings to identify such accounts. We show that from user information alone, we can identify such accounts with F1 score of 94.7% (macro average). With the addition of only one tweet as input, the F1 score rises to 96.8%.

pdf bib
UL2C: Mapping User Locations to Countries on Arabic Twitter
Hamdy Mubarak | Sabit Hassan
Proceedings of the Sixth Arabic Natural Language Processing Workshop

Mapping user locations to countries can be useful for many applications such as dialect identification, author profiling, recommendation system, etc. Twitter allows users to declare their locations as free text, and these user-declared locations are often noisy and hard to decipher automatically. In this paper, we present the largest manually labeled dataset for mapping user locations on Arabic Twitter to their corresponding countries. We build effective machine learning models that can automate this mapping with significantly better efficiency compared to libraries such as geopy. We also show that our dataset is more effective than data extracted from GeoNames geographical database in this task as the latter covers only locations written in formal ways.

pdf bib
ArCorona: Analyzing Arabic Tweets in the Early Days of Coronavirus (COVID-19) Pandemic
Hamdy Mubarak | Sabit Hassan
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

Over the past few months, there were huge numbers of circulating tweets and discussions about Coronavirus (COVID-19) in the Arab region. It is important for policy makers and many people to identify types of shared tweets to better understand public behavior, topics of interest, requests from governments, sources of tweets, etc. It is also crucial to prevent spreading of rumors and misinformation about the virus or bad cures. To this end, we present the largest manually annotated dataset of Arabic tweets related to COVID-19. We describe annotation guidelines, analyze our dataset and build effective machine learning and transformer based models for classification.

2020

pdf bib
ALT Submission for OSACT Shared Task on Offensive Language Detection
Sabit Hassan | Younes Samih | Hamdy Mubarak | Ahmed Abdelali | Ammar Rashed | Shammur Absar Chowdhury
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

In this paper, we describe our efforts at OSACT Shared Task on Offensive Language Detection. The shared task consists of two subtasks: offensive language detection (Subtask A) and hate speech detection (Subtask B). For offensive language detection, a system combination of Support Vector Machines (SVMs) and Deep Neural Networks (DNNs) achieved the best results on development set, which ranked 1st in the official results for Subtask A with F1-score of 90.51% on the test set. For hate speech detection, DNNs were less effective and a system combination of multiple SVMs with different parameters achieved the best results on development set, which ranked 4th in official results for Subtask B with F1-macro score of 80.63% on the test set.

pdf bib
ALT at SemEval-2020 Task 12: Arabic and English Offensive Language Identification in Social Media
Sabit Hassan | Younes Samih | Hamdy Mubarak | Ahmed Abdelali
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the systems submitted by the Arabic Language Technology group (ALT) at SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media. We focus on sub-task A (Offensive Language Identification) for two languages: Arabic and English. Our efforts for both languages achieved more than 90% macro-averaged F1-score on the official test set. For Arabic, the best results were obtained by a system combination of Support Vector Machine, Deep Neural Network, and fine-tuned Bidirectional Encoder Representations from Transformers (BERT). For English, the best results were obtained by fine-tuning BERT.

pdf bib
Constructing a Bilingual Corpus of Parallel Tweets
Hamdy Mubarak | Sabit Hassan | Ahmed Abdelali
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

In a bid to reach a larger and more diverse audience, Twitter users often post parallel tweets—tweets that contain the same content but are written in different languages. Parallel tweets can be an important resource for developing machine translation (MT) systems among other natural language processing (NLP) tasks. In this paper, we introduce a generic method for collecting parallel tweets. Using this method, we collect a bilingual corpus of English-Arabic parallel tweets and a list of Twitter accounts who post English-Arabictweets regularly. Since our method is generic, it can also be used for collecting parallel tweets that cover less-resourced languages such as Serbian and Urdu. Additionally, we annotate a subset of Twitter accounts with their countries of origin and topic of interest, which provides insights about the population who post parallel tweets. This latter information can also be useful for author profiling tasks.

2019

pdf bib
The MADAR Shared Task on Arabic Fine-Grained Dialect Identification
Houda Bouamor | Sabit Hassan | Nizar Habash
Proceedings of the Fourth Arabic Natural Language Processing Workshop

In this paper, we present the results and findings of the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. This shared task was organized as part of The Fourth Arabic Natural Language Processing Workshop, collocated with ACL 2019. The shared task includes two subtasks: the MADAR Travel Domain Dialect Identification subtask (Subtask 1) and the MADAR Twitter User Dialect Identification subtask (Subtask 2). This shared task is the first to target a large set of dialect labels at the city and country levels. The data for the shared task was created or collected under the Multi-Arabic Dialect Applications and Resources (MADAR) project. A total of 21 teams from 15 countries participated in the shared task.