Mabrouka Bessghaier
2026
A Multi-Task Learning Framework for Modeling Engagement and Topic-Sensitive Responses in Arabic Women’s Discourse
Mabrouka Bessghaier | Md. Rafiul Biswas | Shimaa Ibrahim | Wajdi Zaghouani
Findings of the Association for Computational Linguistics: EACL 2026
Mabrouka Bessghaier | Md. Rafiul Biswas | Shimaa Ibrahim | Wajdi Zaghouani
Findings of the Association for Computational Linguistics: EACL 2026
Predicting how audiences react to Arabic social media posts requires reasoning beyond textual sentiment: reactions emerge from collective interpretation moderated by engagement dynamics and topical context. We present a multi-task learning (MTL) framework that jointly learns (i) audience reaction classification (Love, Haha, Angry, Sad, Care, Wow), (ii) engagement magnitude regression (six reactions, comments, shares), and (iii) non-engagement detection. On a corpus of 158k Arabic Facebook posts spanning women’s rights, gender debates, and economic empowerment, our model achieves a test macro-F1 of 72.4 and weighted-F1 of 89.1.
Audience Engagement with Arabic Women’s Social Empowerment and Wellbeing: A Decadal Corpus
Wajdi Zaghouani | Mabrouka Bessghaier | Md. Rafiul Biswas | Shimaa Amer Ibrahim
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Wajdi Zaghouani | Mabrouka Bessghaier | Md. Rafiul Biswas | Shimaa Amer Ibrahim
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper presents the Arabic Women and Society Corpus, a ten-year collection of 252,487 public Arabic Facebook posts related to women’s empowerment and social wellbeing. The corpus was collected from 51,660 pages across 77 countries between 2014 and 2024, resulting in more than 267 million user interactions. Each post includes engagement metrics such as shares, comments, and emotional reactions, providing a unique view of audience sentiment and social attention. The data were processed using an automated pipeline with language identification, normalization, and metadata cleaning to ensure reliability and reproducibility. The corpus enables large-scale analysis of gender discourse, social reform, and emotional engagement across Arabic dialects. It supports research in Arabic natural language processing, computational social science, and digital communication studies. The dataset and accompanying documentation will be released publicly for research use under an open license.
ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication
Wajdi Zaghouani | Md. Rafiul Biswas | Mabrouka Bessghaier | Shimaa Amer Ibrahim | George Mikros
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Wajdi Zaghouani | Md. Rafiul Biswas | Mabrouka Bessghaier | Shimaa Amer Ibrahim | George Mikros
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present ClimateChat-300K, a large-scale dataset of 299,329 public Facebook posts about climate change collected between May 2020 and May 2024 through the CrowdTangle platform. The dataset contains 41 metadata features including post content, engagement metrics, and page attributes, covering material from more than 26,000 global pages. Each post includes rich contextual information such as language, timestamp, page category, and interaction counts, enabling comprehensive analyses of public discourse around climate communication. Using topic modeling and sentiment analysis, we identify ten main themes grouped into five domains: policy, activism, cooperation, science, and conservation. The results reveal that emotional tone, post format, and page identity strongly influence audience engagement, with visually rich and emotionally charged content receiving the highest levels of interaction. The dataset also demonstrates how online discussions evolved in response to major events such as international climate summits and the COVID-19 pandemic period. ClimateChat-300K provides an open resource for reproducible and interdisciplinary research on polarization, misinformation, and the dynamics of digital climate discourse. By releasing this dataset, we aim to support transparent, data-driven research and contribute to a deeper understanding of how public engagement with climate issues develops across time, geography, and institutional contexts.
JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media
Wajdi Zaghouani | Shimaa Amer Ibrahim | Mabrouka Bessghaier | Houda Bouamor
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Wajdi Zaghouani | Shimaa Amer Ibrahim | Mabrouka Bessghaier | Houda Bouamor
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This paper introduces JobArabi, a large-scale corpus of Arabic job announcements collected from social media between January 2024 and October 2025. The dataset contains 20,528 public posts from X and captures more than two years of employment-related discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework covering 21 Arabic keyword families that reflect gendered, plural, formal, and dialectal expressions of recruitment language. The resulting dataset includes posts from institutional, commercial, and individual accounts and provides metadata such as timestamps, engagement indicators, and geolocation when available, enabling temporal and regional analysis of employment discourse.Quantitative analysis reveals several sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional variation in occupational demand, and the emotional framing of recruitment messages. These findings highlight the potential of Arabic social media as a resource for studying labor market communication and linguistic change.The JobArabi corpus, together with documentation and collection scripts, will be released to support research in Arabic NLP, computational social science, and digital labor studies.
ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination
Wajdi Zaghouani | Shimaa Amer Ibrahim | Mabrouka Bessghaier | Houda Bouamor
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Wajdi Zaghouani | Shimaa Amer Ibrahim | Mabrouka Bessghaier | Houda Bouamor
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014–2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism, 100 discrimination) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.
From Posts to Pressure: An Arabic Dataset about Stress and Mental-Health Monitoring
Wajdi Zaghouani | Eman Sedqy Shlkamy | Mabrouka Bessghaier
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Wajdi Zaghouani | Eman Sedqy Shlkamy | Mabrouka Bessghaier
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
How do Arabic-speaking communities express and engage with psychological stress on social media? We introduce AraStress, the first large-scale Arabic corpus dedicated to psychological stress research, comprising 175,862 public social media posts from 2020 to 2024, covering pandemic and post-pandemic periods.It fills a significant gap in Arabic mental-health NLP resources focused on stress, enabling large-scale analysis of related expressions.Unlike prior work focusing primarily on Twitter and depression or suicidality, AraStress addresses the critical gap in stress-focused resources. Our lexicon-based analysis reveals that stress-related posts elicit predominantly affective engagement and exhibit a hybrid lexical framing that integrates religious and therapeutic language. AraStress provides a foundational resource for culturally grounded computational models of stress detection and digital wellbeing in Arabic-speaking communities.
2025
MarsadLab at NADI Shared Task: Arabic Dialect Identification and Speech Recognition using ECAPA-TDNN and Whisper
Md. Rafiul Biswas | Kais Attia | Shimaa Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Md. Rafiul Biswas | Kais Attia | Shimaa Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
MarsadLab at TAQEEM 2025: Prompt-Aware Lexicon-Enhanced Transformer for Arabic Automated Essay Scoring
Mabrouka Bessghaier | Md. Rafiul Biswas | Amira Dhouib | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Mabrouka Bessghaier | Md. Rafiul Biswas | Amira Dhouib | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
MAHED Shared Task: Multimodal Detection of Hope and Hate Emotions in Arabic Content
Wajdi Zaghouani | Md. Rafiul Biswas | Mabrouka Bessghaier | Shimaa Ibrahim | George Mikros | Abul Hasnat | Firoj Alam
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Wajdi Zaghouani | Md. Rafiul Biswas | Mabrouka Bessghaier | Shimaa Ibrahim | George Mikros | Abul Hasnat | Firoj Alam
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
MarsadLab at AraHealthQA: Hybrid Contextual–Lexical Fusion with AraBERT for Question and Answer Categorization
Mabrouka Bessghaier | Shimaa Ibrahim | Md. Rafiul Biswas | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Mabrouka Bessghaier | Shimaa Ibrahim | Md. Rafiul Biswas | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Evaluation of Pretrained and Instruction-Based Pretrained Models for Emotion Detection in Arabic Social Media Text
Md. Rafiul Biswas | Shimaa Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Md. Rafiul Biswas | Shimaa Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
This study evaluates three approaches—instruction prompting of large language models (LLMs), instruction fine-tuning of LLMs, and transformer-based pretrained models on emotion detection in Arabic social media text. We compare pretrained transformer models like AraBERT, CaMelBERT, and XLM-RoBERTa with instruction prompting with advanced LLMs like GPT-4o, Gemini, Deepseek, and Fanar, and instruction fine-tuning approaches with LLMs like Llama 3.1, Mistral, and Phi. With a highly preprocessed dataset of 10,000 labeled Arabic tweets with overlapping emotional labels, our findings reveal that transformer-based pretrained models outperform instruction prompting and instruction fine-tuning approaches. Instruction prompts leverage general linguistic skills with maximum efficiency but fall short in detecting subtle emotional contexts. Instruction fine-tuning is more specific but trails behind pretrained transformer models. Our findings establish the need for optimized instruction-based approaches and underscore the important role played by domain-specific transformer architectures in accurate Arabic emotion detection.
MarsadLab at BAREC Shared Task 2025: Strict-Track Readability Prediction with Specialized AraBERT Models on BAREC
Shimaa Ibrahim | Md. Rafiul Biswas | Mabrouka Bessghaier | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Shimaa Ibrahim | Md. Rafiul Biswas | Mabrouka Bessghaier | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Ahasis Shared Task: Hybrid Lexicon-Augmented AraBERT Model for Sentiment Detection in Arabic Dialects
Shimaa Amer Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects
Shimaa Amer Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects
This work was conducted as part of the Ahasis@RANLP–2025 shared task, which focuses on sentiment detection in Arabic dialects within the hotel review domain. The primary objective is to advance sentiment analysis methodologies tailored to dialectal Arabic. Our work combines data augmentation with a hybrid model that integrates AraBERT and our created sentiment lexicon. Notably, our hybrid model significantly improved performance, reaching an F1-score of 0.74, compared to 0.56 when using only AraBERT. These results highlight the effectiveness of lexicon integration and augmentation strategies in enhancing both the accuracy and robustness of sentiment classification in dialectal Arabic.
MarsadLab at AraGenEval Shared Task: LLM-Based Approaches to Arabic Authorship Style Transfer and Identification
Md. Rafiul Biswas | Mabrouka Bessghaier | Firoj Alam | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Md. Rafiul Biswas | Mabrouka Bessghaier | Firoj Alam | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
2024
MARASTA: A Multi-dialectal Arabic Cross-domain Stance Corpus
Anis Charfi | Mabrouka Bessghaier | Andria Atalla | Raghda Akasheh | Sara Al-Emadi | Wajdi Zaghouani
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Anis Charfi | Mabrouka Bessghaier | Andria Atalla | Raghda Akasheh | Sara Al-Emadi | Wajdi Zaghouani
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper introduces a cross-domain and multi-dialectal stance corpus for Arabic that includes four regions in the Arab World and covers the main Arabic dialect groups. Our corpus consists of 4657 sentences manually annotated with each sentence’s stance towards a specific topic. For each region, we collected sentences related to two controversial topics. We annotated each sentence by at least two annotators to indicate if its stance favors the topic, is against it, or is neutral. Our corpus is well-balanced concerning dialect and stance. Approximately half of the sentences are in Modern Standard Arabic (MSA) for each region, and the other half is in the region’s respective dialect. We conducted several machine-learning experiments for stance detection using our new corpus. Our most successful model is the Multi-Layer Perceptron (MLP), using Unigram or TF-IDF extracted features, which yielded an F1-score of 0.66 and an accuracy score of 0.66. Compared with the most similar state-of-the-art dataset, our dataset outperformed in specific stance classes, particularly “neutral” and “against”.