Proceedings of The Second Arabic Natural Language Processing Conference

Nizar Habash, Houda Bouamor, Ramy Eskander, Nadi Tomeh, Ibrahim Abu Farha, Ahmed Abdelali, Samia Touileb, Injy Hamed, Yaser Onaizan, Bashar Alhafni, Wissam Antoun, Salam Khalifa, Hatem Haddad, Imed Zitouni, Badr AlKhamissi, Rawan Almatham, Khalil Mrini (Editors)

Anthology ID:: 2024.arabicnlp-1
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Venues:: ArabicNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2024.arabicnlp-1
DOI:
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ingest-bitext-workshop/2024.arabicnlp-1.pdf

PDF (full) BibTeX Search

pdf bib abs
Wikidata as a Source of Demographic Information
Samir Abdaljalil | Hamdy Mubarak

Names carry important information about our identities and demographics such as gender, nationality, ethnicity, etc. We investigate the use of individual’s name, in both Arabic and English, to predict important attributes, namely country, region, gender, and language. We extract data from Wikidata, and normalize it, to build a comprehensive dataset consisting of more than 1 million entities and their normalized attributes. We experiment with a Linear SVM approach, as well as two Transformers approaches consisting of BERT model fine-tuning and Transformers pipeline. Our results indicate that we can predict the gender, language and region using the name only with a confidence over 0.65. The country attribute can be predicted with less accuracy. The Linear SVM approach outperforms the other approaches for all the attributes. The best performing approach was also evaluated on another dataset that consists of 1,500 names from 15 countries (covering different regions) extracted from Twitter, and yields similar results.

pdf bib abs
Synthetic Arabic Medical Dialogues Using Advanced Multi-Agent LLM Techniques
Mariam ALMutairi | Lulwah AlKulaib | Melike Aktas | Sara Alsalamah | Chang-Tien Lu

The increasing use of artificial intelligence in healthcare requires robust datasets for training and validation, particularly in the domain of medical conversations. However, the creation and accessibility of such datasets in Arabic face significant challenges, especially due to the sensitivity and privacy concerns that are associated with medical conversations. These conversations are rarely recorded or preserved, making the availability of comprehensive Arabic medical dialogue datasets scarce. This limitation slows down not only the development of effective natural language processing models but also restricts the opportunity for open comparison of algorithms and their outcomes. Recent advancements in large language models (LLMs) like ChatGPT, GPT-4, Gemini-pro, and Claude-3 show promising capabilities in generating synthetic data. To address this gap, we introduce a novel Multi-Agent LLM approach capable of generating synthetic Arabic medical dialogues from patient notes, regardless of the original language. This development presents a significant step towards overcoming the barriers in dataset availability, enhancing the potential for broader research and application in AI-driven medical dialogue systems.

pdf abs
AuRED: Enabling Arabic Rumor Verification using Evidence from Authorities over Twitter
Fatima Haouari | Tamer Elsayed | Reem Suwaileh

Diverging from the trend of the previous rumor verification studies, we introduce the new task of rumor verification using evidence that are exclusively captured from authorities, i.e., entities holding the right and knowledge to verify corresponding information. To enable research on this task for Arabic low-resourced language, we construct and release the first Authority-Rumor-Evidence Dataset (AuRED). The dataset comprises 160 rumors expressed in tweets and 692 Twitter timelines of authorities containing about 34k tweets. Additionally, we explore how existing evidence retrieval and claim verification models for fact-checking perform on our task under both the cross-lingual zero-shot and in-domain fine-tuning setups. Our experiments show that although evidence retrieval models perform relatively well on the task establishing strong baselines, there is still a big room for improvement. However, existing claim verification models perform poorly on the task no matter how good the retrieval performance is. The results also show that stance detection can be useful for evidence retrieval. Moreover, existing fact-checking datasets showed a potential in transfer learning to our task, however, further investigation using different datasets and setups is required.

Dialectal Arabic is the primary spoken language used by native Arabic speakers in daily communication. The rise of social media platforms has notably expanded its use as a written language. However, Arabic dialects do not have standard orthographies. This, combined with the inherent noise in user-generated content on social media, presents a major challenge to NLP applications dealing with Dialectal Arabic. In this paper, we explore and report on the task of CODAfication, which aims to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA). We work with a unique parallel corpus of multiple Arabic dialects focusing on five major city dialects. We benchmark newly developed pretrained sequence-to-sequence models on the task of CODAfication. We further show that using dialect identification information improves the performance across all dialects. We make our code, data, andpretrained models publicly available.

pdf abs
Strategies for Arabic Readability Modeling
Juan Liberato | Bashar Alhafni | Muhamed Khalil | Nizar Habash

Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility. However, Arabic readability assessment is a challenging task due to Arabic’s morphological richness and limited readability resources. In this paper, we present a set of experimental results on Arabic readability assessment using a diverse range of approaches, from rule-based methods to Arabic pretrained language models. We report our results on a newly created corpus at different textual granularity levels (words and sentence fragments). Our results show that combining different techniques yields the best results, achieving an overall macro F1 score of 86.7 at the word level and 87.9 at the fragment level on a blind test set. We make our code, data, and pretrained models publicly available.

pdf abs
AREEj: Arabic Relation Extraction with Evidence
Osama Mraikhat | Hadi Hamoud | Fadi Zaraket

Relational entity extraction is key in building knowledge graphs. A relational entity has a source, a tail and atype. In this paper, we consider Arabic text and introduce evidence enrichment which intuitivelyinforms models for better predictions. Relational evidence is an expression in the textthat explains how sources and targets relate. %It also provides hints from which models learn. This paper augments the existing relational extraction dataset with evidence annotation to its 2.9-million Arabic relations.We leverage the augmented dataset to build , a relation extraction with evidence model from Arabic documents. The evidence augmentation model we constructed to complete the dataset achieved .82 F1-score (.93 precision, .73 recall). The target outperformed SOTA mREBEL with .72 F1-score (.78 precision, .66 recall).

pdf abs
Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis
Sabri Boughorbel | Md Rizwan Parvez | Majd Hawasly

Training LLMs in low resources languages usually utilizes machine translation (MT) data augmentation from English language. However, translation brings a number of challenges: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, the quality of the data degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality stories generated by a capable LLM in Arabic, representing 1% of the original training data. We show, using GPT-4 as a judge and dictionary learning analysis from mechanistic interpretability, that the suggested approach is a practical means to resolve some of the translation pitfalls. We illustrate the improvement through case studies of linguistic and cultural bias issues.

pdf abs
A Context-Contrastive Inference Approach To Partial Diacritization
Muhammad ElNokrashy | Badr AlKhamissi

Diacritization plays a pivotal role for meaning disambiguation and improving readability in Arabic texts. Efforts have long focused on marking every eligible character (Full Diacritization). Overlooked in comparison, Partial Diacritzation (‘PD‘) is the selection of a subset of characters to be annotated to aid comprehension only where needed. Research has indicated that excessive diacritic marks can hinder skilled readers—reducing reading speed and accuracy. We conduct a behavioral experiment and show that partially marked text is often easier to read than fully marked text, and sometimes easier than plain text. In this light, we introduce Context-Contrastive Partial Diacritization (‘CCPD‘)—a novel approach to ‘PD‘ which integrates seamlessly with existing Arabic diacritization systems. ‘CCPD‘ processes each word twice, once with context and once without, and diacritizes only the characters with disparities between the two inferences. Further, we introduce novel indicators for measuring partial diacritization quality to help establish this as a machine learning task. Lastly, we introduce ‘TD2‘, a Transformer-variant of an established model which offers a markedly different performance profile on our proposed indicators compared to all other known systems.

This paper introduces Arabic Contrastive Language-Image Pre-training (AraCLIP), a model designed for Arabic image retrieval tasks, building upon the Contrastive Language-Image Pre-training (CLIP) architecture. AraCLIP leverages Knowledge Distillation to transfer cross-modal knowledge from English to Arabic, enhancing its ability to understand Arabic text and retrieve relevant images. Unlike existing multilingual models, AraCLIP is uniquely positioned to understand the intricacies of the Arabic language, including specific terms, cultural nuances, and contextual constructs. By leveraging the CLIP architecture as our foundation, we introduce a novel approach that seamlessly integrates textual and visual modalities, enabling AraCLIP to effectively retrieve images based on Arabic textual queries. We offer an online demonstration allowing users to input Arabic prompts and compare AraCLIP’s performance with state-of-the-art multilingual models. We conduct comprehensive experiments to evaluate AraCLIP’s performance across diverse datasets, including Arabic XTD-11, and Arabic Flicker 8k. Our results showcase AraCLIP’s superiority in image retrieval accuracy, demonstrating its effectiveness in handling Arabic queries. AraCLIP represents a significant advancement in cross-lingual image retrieval, offering promising applications in Arabic language processing and beyond.

pdf abs
Large Language Models as Legal Translators of Arabic Legislatives: Does ChatGPT and Gemini Care for Context and Terminology?
Khadija ElFqih | Johanna Monti

Accurate translation of terminology and adaptation to in-context information is a pillar to high quality translation. Recently, there is a remarkable interest towards the use and the evaluation of Large Language Models (LLMs) particularly for Machine Translation tasks. Nevertheless, despite their recent advancement and ability to understand and generate human-like language, these LLMs are still far from perfect, especially in domain-specific scenarios, and need to be thoroughly investigated. This is particularly evident in automatically translating legal terminology from Arabic into English and French, where, beyond the inherent complexities of legal language and specialised translations, technical limitations of LLMs further hinder accurate generation of text. In this paper, we present a preliminary evaluation of two evolving LLMs, namely GPT-4 Generative Pre-trained Transformer and Gemini, as legal translators of Arabic legislatives to test their accuracy and the extent to which they care for context and terminology across two language pairs (AR→EN / AR→FR). The study targets the evaluation of Zero-Shot prompting for in-context and out-of-context scenarios of both models relying on a gold standard dataset, verified by professional translators who are also experts in the field. We evaluate the results applying the Multidimensional Quality Metrics to classify translation errors. Moreover, we also evaluate the general LLMs outputs to verify their correctness, consistency, and completeness.In general, our results show that the models are far from perfect and recall for more fine-tuning efforts using specialised terminological data in the legal domain from Arabic into English and French.

pdf abs
Towards Zero-Shot Text-To-Speech for Arabic Dialects
Khai Doan | Abdul Waheed | Muhammad Abdul-Mageed

Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.

pdf abs
Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect
Salima Mdhaffar | Haroun Elleuch | Fethi Bougares | Yannick Estève

Speech encoders pretrained through self-supervised learning (SSL) have demonstrated remarkable performance in various downstream tasks, including Spoken Language Understanding (SLU) and Automatic Speech Recognition (ASR). For instance, fine-tuning SSL models for such tasks has shown significant potential, leading to improvements in the SOTA performance across challenging datasets.In contrast to existing research, this paper contributes by comparing the effectiveness of SSL approaches in the context of (i) the low-resource Spoken Tunisian Arabic Dialect and (ii) its combination with a low-resource SLU and ASR scenario, where only a few semantic annotations are available for fine-tuning. We conducted experiments using many SSL speech encoders on the TARIC-SLU dataset. We used speech encoders that were pre-trained on either monolingual or multilingual speech data. Some of them have also been refined without in-domain nor Tunisian data through a multimodal supervised teacher-student learning. The study made in this paper yields numerous significant findings that we will discuss in the paper.

pdf abs
Arabic Automatic Story Generation with Large Language Models
Ahmed El-Shangiti | Fakhraddin Alwajih | Muhammad Abdul-Mageed

Large language models (LLMs) have recently emerged as a powerful tool for a wide range of language generation tasks. Nevertheless, this progress has been slower in Arabic. In this work, we focus on the task of generating stories from LLMs. For our training, we use stories acquired through machine translation (MT) as well as GPT-4. For the MT data, we develop a careful pipeline that ensures we acquire high-quality stories. For our GPT-4 data, we introduce crafted prompts that allow us to generate data well-suited to the Arabic context in both Modern Standard Arabic (MSA) and two Arabic dialects (Egyptian and Moroccan). For example, we generate stories tailored to various Arab countries on a wide host of topics. Our manual evaluation shows that our model fine-tuned on these training datasets can generate coherent stories that adhere to our instructions. We also conduct an extensive automatic and human evaluation comparing our models against state-of-the-art proprietary and open-source models. Our datasets and models will be made publicly available at https://github.com/UBC-NLP/arastories.

Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at: https://github.com/amurtadha/Alclam.

pdf abs
Data Augmentation for Speech-Based Diacritic Restoration
Sara Shatnawi | Sawsan Alqahtani | Shady Shehata | Hanan Aldarmaki

This paper describes a data augmentation technique for boosting the performance of speech-based diacritic restoration. Our experiments demonstrate the utility of this appraoch, resulting in improved generalization of all models across different test sets. In addition, we describe the first multi-modal diacritic restoration model, utilizing both speech and text as input modalities. This type of model can be used to diacritize speech transcripts. Unlike previous work that relies on an external ASR model, the proposed model is far more compact and efficient. While the multi-modal framework does not surpass the ASR-based model for this task, it offers a promising approach for improving the efficiency of speech-based diacritization, with a potential for improvement using data augmentation and other methods.

pdf abs
Out-of-Domain Dependency Parsing for Dialects of Arabic: A Case Study
Noor Mokh | Daniel Dakota | Sandra Kübler

We study dependency parsing for four Arabic dialects (Gulf, Levantine, Egyptian, and Maghrebi). Since no syntactically annotated data exist for Arabic dialects, we train the parser on a Modern Standard Arabic (MSA) corpus, which creates an out-of-domain setting.We investigate methods to close the gap between the source (MSA) and target data (dialects), e.g., by training on syntactically similar sentences to the test data. For testing, we manually annotate a small data set from a dialectal corpus. We focus on parsing two linguistic phenomena, which are difficult to parse: Idafa and coordination. We find that we can improve results by adding in-domain MSA data while adding dialectal embeddings only results in minor improvements.

pdf abs
Investigating Linguistic Features for Arabic NLI
Yasmeen Bassas | Sandra Kübler

Native Language Identification (NLI) is concerned with predicting the native language of an author writing in a second language. We investigate NLI for Arabic, with a focus on the types of linguistic information given that Arabic is morphologically rich. We use the Arabic Learner Corpus (ALC) foro training and testing along with a linear SVM. We explore lexical, morpho-syntactic, and syntactic features. Results show that the best single type of information is character n-grams ranging from 2 to 6. Using this model, we achieve an accuracy of 61.84%, thus outperforming previous results (Ionesco, 2015) by 11.74% even though we use an additional 2 L1s. However, when using prefix and suffix sequences, we reach an accuracy of 53.95%, showing that an approximation of unlexicalized features still reaches solid results.

pdf abs
John vs. Ahmed: Debate-Induced Bias in Multilingual LLMs
Anastasiia Demidova | Hanin Atwany | Nour Rabih | Sanad Sha’ban | Muhammad Abdul-Mageed

Large language models (LLMs) play a crucial role in a wide range of real world applications. However, concerns about their safety and ethical implications are growing. While research on LLM safety is expanding, there is a noticeable gap in evaluating safety across multiple languages, especially in Arabic and Russian. We address this gap by exploring biases in LLMs across different languages and contexts, focusing on GPT-3.5 and Gemini. Through carefully designed argument-based prompts and scenarios in Arabic, English, and Russian, we examine biases in cultural, political, racial, religious, and gender domains. Our findings reveal biases in these domains. In particular, our investigation uncovers subtle biases where each model tends to present winners as those speaking the primary language the model is prompted with. Our study contributes to ongoing efforts to ensure justice and equality in LLM development and emphasizes the importance of further research towards responsible progress in this field.

pdf abs
Qalam: A Multimodal LLM for Arabic Optical Character and Handwriting Recognition
Gagan Bhatia | El Moatez Billah Nagoudi | Fakhraddin Alwajih | Muhammad Abdul-Mageed

Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces ***Qalam***, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train ***Qalam*** on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, ***Qalam*** demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore ***Qalam***’s potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.

The rapid advancements in Large Language Models (LLMs) have led to significant improvements in various natural language processing tasks. However, the evaluation of LLMs’ legal knowledge, particularly in non English languages such as Arabic, remains under-explored. To address this gap, we introduce ArabLegalEval, a multitask benchmark dataset for assessing the Arabic legal knowledge of LLMs. Inspired by the MMLU and LegalBench datasets, ArabLegalEval consists of multiple tasks sourced from Saudi legal documents and synthesized questions. In this work, we aim to analyze the capabilities required to solve legal problems in Arabic and benchmark the performance of state-of-the-art LLMs. We explore the impact of in-context learning on performance and investigate various evaluation methods. Additionally, we explore workflows for automatically generating questions with automatic validation to enhance the dataset’s quality. By releasing ArabLegalEval and our code, we hope to accelerate AI research in the Arabic Legal domain

pdf abs
CATT: Character-based Arabic Tashkeel Transformer
Faris Alasmary | Orjuwan Zaafarani | Ahmad Ghannam

Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence.It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation.This paper introduces a new approach to training ATD models.First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT.Then, we applied the Noisy-Student approach to boost the performance of the best model.We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset.Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83% and 35.21% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD.In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36%.We open-source our CATT models and benchmark dataset for the research community .

pdf abs
Picking Up Where the Linguist Left Off: Mapping Morphology to Phonology through Learning the Residuals
Salam Khalifa | Abdelrahim Qaddoumi | Ellen Broselow | Owen Rambow

Learning morphophonological mappings between the spoken form of a language and its underlying morphological structures is crucial for enriching resources for morphologically rich languages like Arabic. In this work, we focus on Egyptian Arabic as our case study and explore the integration of linguistic knowledge with a neural transformer model. Our approach involves learning to correct the residual errors from hand-crafted rules to predict the spoken form from a given underlying morphological representation. We demonstrate that using a minimal set of rules, we can effectively recover errors even in very low-resource settings.

pdf abs
On the Utility of Pretraining Language Models on Synthetic Data
Alcides Alcoba Inciarte | Sang Yun Kwon | El Moatez Billah Nagoudi | Muhammad Abdul-Mageed

Development of pre-trained language models has predominantly relied on large amounts of datasets. However, this dependence on abundant data has limited the applicability of these models in low-resource settings. In this work, we investigate the utility of exploiting synthetic datasets acquired from different sources to pre-train language models for Arabic. Namely, we leverage data derived based on four different methods: optical character recognition (OCR), automatic speech recognition (ASR), machine translation (MT), and generative language models. We use these datasets to pre-train models in three different architectures: encoder-only (BERTBase), encoder-decoder (T5), and decoder-only (GPT-2). We test the capabilities of resulting models on Arabic natural language understanding (NLU) tasks using the ORCA benchmark. Our results show that utilizing synthetic data can achieve performance comparable to, or even surpassing, those trained on gold data. For example, our model based on a GPT-2 architecture trained on a combined synthetic dataset surpasses the baseline model ARBERTv2. Overall, our models pre-trained on synthetic data demonstrate robust performance across various tasks. This highlights the potential of synthetic datasets in augmenting language model training in low-resource settings.

pdf abs
Benchmarking LLaMA-3 on Arabic Language Generation Tasks
Md Tawkat Islam Khondaker | Numaan Naeem | Fatimah Khan | AbdelRahim Elmadany | Muhammad Abdul-Mageed

Open-sourced large language models (LLMs) have exhibited remarkable performance in a variety of NLP tasks, often catching up with the closed-sourced LLMs like ChatGPT. Among these open LLMs, LLaMA-3-70B has emerged as the most recent and the most prominent one. However, how LLaMA-3-70B would situate itself in multilingual settings, especially in a rich morphological language like Arabic, has yet to be explored. In this work, we focus to bridge this gap by evaluating LLaMA-3-70B on a diverse set of Arabic natural language generation (NLG) benchmarks. To the best of our knowledge, this is the first study that comprehensively evaluates LLaMA-3-70B on tasks related to Arabic natural language generation. Our study reveals that LLaMA-3-70B lags behind the closed LLMs like ChatGPT, both in modern standard Arabic (MSA) and dialectal Arabic (DA). We further compare the performance of LLaMA-3-70B with our smaller and dedicated finetuned Arabic models. We find that both LLaMA-3-70B and ChatGPT are outperformed by comparatively smaller dedicated Arabic models, indicating the scope for potential improvement with Arabic-focused LLMs.

pdf abs
From Nile Sands to Digital Hands: Machine Translation of Coptic Texts
Muhammed Saeed | Asim Mohamed | Mukhtar Mohamed | Shady Shehata | Muhammad Abdul-Mageed

The Coptic language, rooted in the historical landscapes of Egypt, continues to serve as a vital liturgical medium for the Coptic Orthodox and Catholic Churches across Egypt, North Sudan, Libya, and the United States, with approximately ten million speakers worldwide. However, the scarcity of digital resources in Coptic has resulted in its exclusion from digital systems, thereby limiting its accessibility and preservation in modern technological contexts. Our research addresses this issue by developing the most extensive parallel Coptic-centered corpus to date. This corpus comprises over 8,000 parallel sentences between Arabic and Coptic, and more than 24,000 parallel sentences between English and Coptic. We have also developed the first neural machine translation system between Coptic, English, and Arabic. Lastly, we evaluate the capability of leading proprietary Large Language Models (LLMs) to translate to and from Coptic using a few-shot learning approach (in-context learning). Our code and data are available at https://github.com/UBC-NLP/copticmt.

pdf abs
Event-Arguments Extraction Corpus and Modeling using BERT for Arabic
Alaa Aljabari | Lina Duaibes | Mustafa Jarrar | Mohammed Khalilia

Event-argument extraction is a challenging task, particularly in Arabic due to sparse linguistic resources. To fill this gap, we introduce the corpus (550k tokens) as an extension of Wojood, enriched with event-argument annotations. We used three types of event arguments: agent, location, and date, which we annotated as relation types. Our inter-annotator agreement evaluation resulted in 82.23% Kappa score and 87.2% F₁-score. Additionally, we propose a novel method for event relation extraction using BERT, in which we treat the task as text entailment. This method achieves an F₁-score of 94.01%.To further evaluate the generalization of our proposed method, we collected and annotated another out-of-domain corpus (about 80k tokens) called and used it as a second test set, on which our approach achieved promising results (83.59% F₁-score). Last but not least, we propose an end-to-end system for event-arguments extraction. This system is implemented as part of SinaTools, and both corpora are publicly available at https://sina.birzeit.edu/wojood

pdf abs
Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic
Fakhraddin Alwajih | Gagan Bhatia | Muhammad Abdul-Mageed

Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high-quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed ***Dallah***, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. ***Dallah*** demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, ***Dallah*** showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, ***Dallah*** has the potential to pave the way for further development of dialect-aware Arabic MLLMs.

Automated Essay Scoring (AES) has emerged as a significant research problem within natural language processing, providing valuable support for educators in assessing student writing skills. In this paper, we introduce QAES, the first publicly available trait-specific annotations for Arabic AES, built on the Qatari Corpus of Argumentative Writing (QCAW). QAES includes a diverse collection of essays in Arabic, each of them annotated with holistic and trait-specific scores, including relevance, organization, vocabulary, style, development, mechanics, and grammar. In total, it comprises 195 Arabic essays (with lengths ranging from 239 to 806 words) across two distinct argumentative writing tasks. We benchmark our dataset against the state-of-the-art English baselines and a feature-based approach. In addition, we discuss the adopted guidelines and the challenges encountered during the annotation process. Finally, we provide insights into potential areas for improvement and future directions in Arabic AES research.

Text classification is of paramount importance in a wide range of applications, including information retrieval, extraction and sentiment analysis. The challenge of classifying and labelling text genres, especially in web-based corpora, has received considerable attention. The frequent absence of unambiguous genre information complicates the identification of text types. To address these issues, the Functional Text Dimensions (FTD) method has been introduced to provide a universal set of categories for text classification. This study presents the Arabic Functional Text Dimensions Corpus (AFTD Corpus), a carefully curated collection of documents for evaluating text classification in Arabic. The AFTD Corpus which we are making available to the community, consists of 3400 documents spanning 17 different class categories. Through a comprehensive evaluation using traditional machine learning and neural models, we assess the effectiveness of the FTD approach in the Arabic context. CAMeLBERT, a state-of-the-art model, achieved an impressive F1 score of 0.81 on our corpus. This research highlights the potential of the FTD method for improving text classification, especially for Arabic content, and underlines the importance of robust classification models in web applications.

This paper presents an overview of the Arabic Natural Language Understanding (ArabicNLU 2024) shared task, focusing on two subtasks: Word Sense Disambiguation (WSD) and Location Mention Disambiguation (LMD). The task aimed to evaluate the ability of automated systems to resolve word ambiguity and identify locations mentioned in Arabic text. We provided participants with novel datasets, including a sense-annotated corpus for WSD, called SALMA with approximately 34k annotated tokens, and the dataset with 3,893 annotations and 763 unique location mentions. These are challenging tasks. Out of the 38 registered teams, only three teams participated in the final evaluation phase, with the highest accuracy being 77.8% for WSD and 95.0% for LMD. The shared task not only facilitated the evaluation and comparison of different techniques, but also provided valuable insights and resources for the continued advancement of Arabic NLU technologies.

pdf abs
Pirates at ArabicNLU2024: Enhancing Arabic Word Sense Disambiguation using Transformer-Based Approaches
Tasneem Wael | Eman Elrefai | Mohamed Makram | Sahar Selim | Ghada Khoriba

This paper presents a novel approach to Ara-bic Word Sense Disambiguation (WSD) lever-aging transformer-based models to tackle thecomplexities of the Arabic language. Utiliz-ing the SALMA dataset, we applied severaltechniques, including Sentence Transformerswith Siamese networks and the SetFit frame-work optimized for few-shot learning. Our ex-periments, structured around a robust evalua-tion framework, achieved a promising F1-scoreof up to 71%, securing second place in theArabicNLU 2024: The First Arabic NaturalLanguage Understanding Shared Task compe-tition. These results demonstrate the efficacyof our approach, especially in dealing with thechallenges posed by homophones, homographs,and the lack of diacritics in Arabic texts. Theproposed methods significantly outperformedtraditional WSD techniques, highlighting theirpotential to enhance the accuracy of Arabicnatural language processing applications.

pdf abs
Upaya at ArabicNLU Shared-Task: Arabic Lexical Disambiguation using Large Language Models
Pawan Rajpoot | Ashvini Jindal | Ankur Parikh

Disambiguating a word’s intended meaning(sense) in a given context is important in Nat-ural Language Understanding (NLU). WSDaims to determine the correct sense of ambigu-ous words in context. At the same time, LMD(a WSD variation) focuses on disambiguatinglocation mention. Both tasks are vital in Nat-ural Language Processing (NLP) and informa-tion retrieval, as they help correctly interpretand extract information from text. Arabic ver-sion is further challenging because of its mor-phological richness, encompassing a complexinterplay of roots, stems, and affixes. This pa-per describes our solutions to both tasks, em-ploying Llama3 and Cohere-based models un-der Zero-Shot Learning and Re-Ranking, re-spectively. Both the shared tasks were partof the second Arabic Natural Language Pro-cessing Conference co-located with ACL 2024.Overall, we achieved 1st rank in the WSD task(accuracy 78%) and 2nd rank in the LMD task(MRR@1 0.59)

pdf abs
rematchka at ArabicNLU2024: Evaluating Large Language Models for Arabic Word Sense and Location Sense Disambiguation
Reem Abdel-Salam

Natural Language Understanding (NLU) plays a vital role in Natural Language Processing (NLP) by facilitating semantic interactions. Arabic, with its diverse morphology, poses a challenge as it allows multiple interpretations of words, leading to potential misunderstandings and errors in NLP applications. In this paper, we present our approach for tackling Arabic NLU shared tasks for word sense disambiguation (WSD) and location mention disambiguation (LMD). Various approaches have been investigated from zero-shot inference of large language models (LLMs) to fine-tuning of pre-trained language models (PLMs). The best approach achieved 57% on WSD task ranking third place, while for the LMD task, our best systems achieved 94% MRR@1 ranking first place.

The expanding financial markets of the Arab world require sophisticated Arabic NLP tools. To address this need within the banking domain, the Arabic Financial NLP (AraFinNLP) shared task proposes two subtasks: (i) Multi-dialect Intent Detection and (ii) Cross-dialect Translation and Intent Preservation. This shared task uses the updated ArBanking77 dataset, which includes about 39k parallel queries in MSA and four dialects. Each query is labeled with one or more of a common 77 intents in the banking domain. These resources aim to foster the development of robust financial Arabic NLP, particularly in the areas of machine translation and banking chat-bots.A total of 45 unique teams registered for this shared task, with 11 of them actively participated in the test phase. Specifically, 11 teams participated in Subtask 1, while only 1 team participated in Subtask 2. The winning team of Subtask 1 achieved F1 score of 0.8773, and the only team submitted in Subtask 2 achieved a 1.667 BLEU score.

pdf abs
SMASH at AraFinNLP2024: Benchmarking Arabic BERT Models on the Intent Detection
Youssef Hariri | Ibrahim Abu Farha

The recent growth in Middle Eastern stock markets has intensified the demand for specialized financial Arabic NLP models to serve this sector. This article presents the participation of Team SMASH of The University of Edinburgh in the Multi-dialect Intent Detection task (Subtask 1) of the Arabic Financial NLP (AraFinNLP) Shared Task 2024. The dataset used in the shared task is the ArBanking77 (Jarrar et al., 2023). We tackled this task as a classification problem and utilized several BERT and BART-based models to classify the queries efficiently. Our solution is based on implementing a two-step hierarchical classification model based on MARBERTv2. We fine-tuned the model by using the original queries. Our team, SMASH, was ranked 9th with a macro F1 score of 0.7866, indicating areas for further refinement and potential enhancement of the model’s performance.

pdf abs
Fired_from_NLP at AraFinNLP 2024: Dual-Phase-BERT - A Fine-Tuned Transformer-Based Model for Multi-Dialect Intent Detection in The Financial Domain for The Arabic Language
Md. Chowdhury | Mostak Chowdhury | Anik Shanto | Hasan Murad | Udoy Das

In the financial industry, identifying user intent from text inputs is crucial for various tasks such as automated trading, sentiment analysis, and customer support. One important component of natural language processing (NLP) is intent detection, which is significant to the finance sector. Limited studies have been conducted in the field of finance using languages with limited resources like Arabic, despite notable works being done in high-resource languages like English. To advance Arabic NLP in the financial domain, the organizer of AraFinNLP 2024 has arranged a shared task for detecting banking intents from the queries in various Arabic dialects, introducing a novel dataset named ArBanking77 which includes a collection of banking queries categorized into 77 distinct intents classes. To accomplish this task, we have presented a hierarchical approach called Dual-Phase-BERT in which the detection of dialects is carried out first, followed by the detection of banking intents. Using the provided ArBanking77 dataset, we have trained and evaluated several conventional machine learning, and deep learning models along with some cutting-edge transformer-based models. Among these models, our proposed Dual-Phase-BERT model has ranked 7^th out of all competitors, scoring 0.801 on the scale of F1-score on the test set.

pdf abs
AlexuNLP24 at AraFinNLP2024: Multi-Dialect Arabic Intent Detection with Contrastive Learning in Banking Domain
Hossam Elkordi | Ahmed Sakr | Marwan Torki | Nagwa El-Makky

Arabic banking intent detection represents a challenging problem across multiple dialects. It imposes generalization difficulties due to the scarcity of Arabic language and its dialects resources compared to English. We propose a methodology that leverages contrastive training to overcome this limitation. We also augmented the data with several dialects using a translation model. Our experiments demonstrate the ability of our approach in capturing linguistic nuances across different Arabic dialects as well as accurately differentiating between banking intents across diverse linguistic landscapes. This would enhance multi-dialect banking services in the Arab world with limited Arabic language resources. Using our proposed method we achieved second place on subtask 1 leaderboard of the AraFinNLP2024 shared task with micro-F1 score of 0.8762 on the test split.

Intention detection is a crucial aspect of natural language understanding (NLU), focusing on identifying the primary objective underlying user input. In this work, we present a transformer-based method that excels in determining the intent of Arabic text within the banking domain. We explored several machine learning (ML), deep learning (DL), and transformer-based models on an Arabic banking dataset for intent detection. Our findings underscore the challenges that traditional ML and DL models face in understanding the nuances of various Arabic dialects, leading to subpar performance in intent detection. However, the transformer-based methods, designed to tackle such complexities, significantly outperformed the other models in classifying intent across different Arabic dialects. Notably, the AraBERTv2 model achieved the highest micro F1 score of 82.08% in ArBanking77 dataset, a testament to its effectiveness in this context. This achievement, which contributed to our work being ranked 5^th in the shared task, AraFinNLP2024, highlights the importance of developing models that can effectively handle the intricacies of Arabic language processing and intent detection.

pdf abs
SENIT at AraFinNLP2024: trust your model or combine two
Abdelmomen Nasr | Moez Ben HajHmida

We describe our submitted system to the 2024 Shared Task on The Arabic Financial NLP (Malaysha et al., 2024). We tackled Subtask 1, namely Multi-dialect Intent Detection. We used state-of-the-art pretrained contextualized text representation models and fine-tuned them according to the downstream task at hand. We started by finetuning multilingual BERT and various Arabic variants, namely MARBERTV1, MARBERTV2, and CAMeLBERT. Then, we employed an ensembling technique to improve our classification performance combining MARBERTV2 and CAMeLBERT embeddings. The findings indicate that MARBERTV2 surpassed all the other models mentioned.

pdf abs
BabelBot at AraFinNLP2024: Fine-tuning T5 for Multi-dialect Intent Detection with Synthetic Data and Model Ensembling
Murhaf Fares | Samia Touileb

This paper presents our results for the Arabic Financial NLP (AraFinNLP) shared task at the Second Arabic Natural Language Processing Conference (ArabicNLP 2024). We participated in the first sub-task, Multi-dialect Intent Detection, which focused on cross-dialect intent detection in the banking domain. Our approach involved fine-tuning an encoder-only T5 model, generating synthetic data, and model ensembling. Additionally, we conducted an in-depth analysis of the dataset, addressing annotation errors and problematic translations. Our model was ranked third in the shared task, achieving a F1-score of 0.871.

pdf abs
MA at AraFinNLP2024: BERT-based Ensemble for Cross-dialectal Arabic Intent Detection
Asmaa Ramadan | Manar Amr | Marwan Torki | Nagwa El-Makky

Intent detection, also called intent classification or recognition, is an NLP technique to comprehend the purpose behind user utterances. This paper focuses on Multi-dialect Arabic intent detection in banking, utilizing the ArBanking77 dataset. Our method employs an ensemble of fine-tuned BERT-based models, integrating contrastive loss for training. To enhance generalization to diverse Arabic dialects, we augment the ArBanking77 dataset, originally in Modern Standard Arabic (MSA) and Palestinian, with additional dialects such as Egyptian, Moroccan, and Saudi, among others. Our approach achieved an F1-score of 0.8771, ranking first in subtask-1 of the AraFinNLP shared task 2024.

pdf abs
BFCI at AraFinNLP2024: Support Vector Machines for Arabic Financial Text Classification
Nsrin Ashraf | Hamada Nayel | Mohammed Aldawsari | Hosahalli Shashirekha | Tarek Elshishtawy

In this paper, a description of the system submitted by BFCAI team to the AraFinNLP2024 shared task has been introduced. Our team participated in the first subtask, which aims at detecting the customer intents of cross-dialectal Arabic queries in the banking domain. Our system follows the common pipeline of text classification models using primary classification algorithms integrated with basic vectorization approach for feature extraction. Multi-layer Perceptron, Stochastic Gradient Descent and Support Vector Machines algorithms have been implemented and support vector machines outperformed all other algorithms with an f-score of 49%. Our submission’s result is appropriate compared to the simplicity of the proposed model’s structure.

pdf abs
dzFinNlp at AraFinNLP: Improving Intent Detection in Financial Conversational Agents
Mohamed Lichouri | Khaled Lounnas | Amziane Zakaria

In this paper, we present our dzFinNlp team’s contribution for intent detection in financial conversational agents, as part of the AraFinNLP shared task. We experimented with various models and feature configurations, including traditional machine learning methods like LinearSVC with TF-IDF, as well as deep learning models like Long Short-Term Memory (LSTM). Additionally, we explored the use of transformer-based models for this task. Our experiments show promising results, with our best model achieving a micro F1-score of 93.02% and 67.21% on the ArBanking77 dataset, in the development and test sets, respectively.

We present an overview of the second edition of the ArAIEval shared task, organized as part of the ArabicNLP 2024 conference co-located with ACL 2024. In this edition, ArAIEval offers two tasks: (i) detection of propagandistic textual spans with persuasion techniques identification in tweets and news articles, and (ii) distinguishing between propagandistic and non-propagandistic memes. A total of 14 teams participated in the final evaluation phase, with 6 and 9 teams participating in Tasks 1 and 2, respectively. Finally, 11 teams submitted system description papers. Across both tasks, we observed that fine-tuning transformer models such as AraBERT was at the core of the majority of the participating systems. We provide a description of the task setup, including a description of the dataset construction and the evaluation setup. We further provide a brief overview of the participating systems. All datasets and evaluation scripts are released to the research community. We hope this will enable further research on these important tasks in Arabic.

pdf abs
MemeMind at ArAIEval Shared Task: Generative Augmentation and Feature Fusion for Multimodal Propaganda Detection in Arabic Memes through Advanced Language and Vision Models
Uzair Shah | Md. Rafiul Biswas | Marco Agus | Mowafa Househ | Wajdi Zaghouani

Detecting propaganda in multimodal content, such as memes, is crucial for combating disinformation on social media. This paper presents a novel approach for the ArAIEval 2024 shared Task 2 on Multimodal Propagandistic Memes Classification, involving text, image, and multimodal classification of Arabic memes. For text classification (Task 2A), we fine-tune state-of-the-art Arabic language models and use ChatGPT4-generated synthetic text for data augmentation. For image classification (Task 2B), we fine-tune ResNet18, EfficientFormerV2, and ConvNeXt-tiny architectures with DALL-E-2-generated synthetic images. For multimodal classification (Task 2C), we combine ConvNeXt-tiny and BERT architectures in a fusion layer to enhance binary classification. Our results show significant performance improvements with data augmentation for text and image classification models and with the fusion layer for multimodal classification. We highlight challenges and opportunities for future research in multimodal propaganda detection in Arabic content, emphasizing the need for robust and adaptable models to combat disinformation.

This paper describes our participation in the ArAIEval Shared Task 2024, focusing on Task 2C, which challenges participants to detect propagandistic elements in multimodal Arabic memes. The challenge involves analyzing both the textual and visual components of memes to identify underlying propagandistic messages. Our approach integrates the capabilities of MARBERT and ResNet50, top-performing pre-trained models for text and image processing, respectively. Our system architecture combines these models through a fusion layer that integrates and processes the extracted features, creating a comprehensive representation that is more effective in detecting nuanced propaganda. Our proposed system achieved significant success, placing second with an F1 score of 0.7987.

pdf abs
Mela at ArAIEval Shared Task: Propagandistic Techniques Detection in Arabic with a Multilingual Approach
Md Riyadh | Sara Nabhani

This paper presents our system submitted for Task 1 of the ArAIEval Shared Task on Unimodal (Text) Propagandistic Technique Detection in Arabic. Task 1 involves identifying all employed propaganda techniques in a given text from a set of possible techniques or detecting that no propaganda technique is present. Additionally, the task requires identifying the specific spans of text where these techniques occur. We explored the capabilities of a multilingual BERT model for this task, focusing on the effectiveness of using outputs from different hidden layers within the model. By fine-tuning the multilingual BERT, we aimed to improve the model’s ability to recognize and locate various propaganda techniques. Our experiments showed that leveraging the hidden layers of the BERT model enhanced detection performance. Our system achieved competitive results, ranking second in the shared task, demonstrating that multilingual BERT models, combined with outputs from hidden layers, can effectively detect and identify spans of propaganda techniques in Arabic text.

pdf abs
MODOS at ArAIEval Shared Task: Multimodal Propagandistic Memes Classification Using Weighted SAM, CLIP and ArabianGPT
Abdelhamid Haouhat | Hadda Cherroun | Slimane Bellaouar | Attia Nehar

Arabic social media platforms are increasingly using propaganda to deceive or influence people. This propaganda is often spread through multimodal content, such as memes. While substantial research has addressed the automatic detection of propaganda in English content, this paper presents the MODOS team’s participation in the Arabic Multimodal Propagandistic Memes Classification shared task. Our system deploys the Segment Anything Model (SAM) and CLIP for image representation and ARABIAN-GPT embeddings for text. Then, we employ LSTM encoders followed by a weighted fusion strategy to perform binary classification. Our system achieved competitive performance in distinguishing between propagandistic and non-propagandistic memes, scored 0.7290 macro F1, and ranked 6th among the participants.

pdf abs
Nullpointer at ArAIEval Shared Task: Arabic Propagandist Technique Detection with Token-to-Word Mapping in Sequence Tagging
Abrar Abir | Kemal Oflazer

This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets & news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging.Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition, incorporating genre information as a feature further enhances the model’s performance. Our system achieved a score of 25.41, placing us 4th on the leaderboard. Subsequent post-submission improvements further raised our score to 26.68.

pdf abs
MemeMind at ArAIEval Shared Task: Spotting Persuasive Spans in Arabic Text with Persuasion Techniques Identification
Md. Rafiul Biswas | Zubair Shah | Wajdi Zaghouani

This paper focuses on detecting propagandistic spans and persuasion techniques in Arabic text from tweets and news paragraphs. Each entry in the dataset contains a text sample and corresponding labels that indicate the start and end positions of propaganda techniques within the text. Tokens falling within a labeled span were assigned ’B’ (Begin) or ’I’ (Inside) tags, ’O’, corresponding to the specific propaganda technique. Using attention masks, we created uniform lengths for each span and assigned BIO tags to each token based on the provided labels. Then, we used AraBERT-base pre-trained model for Arabic text tokenization and embeddings with a token classification layer to identify propaganda techniques. Our training process involves a two-phase fine-tuning approach. First, we train only the classification layer for a few epochs, followed by full model fine-tuning, updating all parameters. This methodology allows the model to adapt to the specific characteristics of the propaganda detection task while leveraging the knowledge captured by the pretrained AraBERT model. Our approach achieved an F1 score of 0.2774, securing the 3rd position in the leaderboard of Task 1.

pdf abs
CLTL at ArAIEval Shared Task: Multimodal Propagandistic Memes Classification Using Transformer Models
Yeshan Wang | Ilia Markov

We present the CLTL system designed for the ArAIEval Shared Task 2024 on multimodal propagandistic memes classification in Arabic. The challenge was divided into three subtasks: identifying propagandistic content from textual modality of memes (subtask 2A), from visual modality of memes (subtask 2B), and in a multimodal scenario when both modalities are combined (subtask 2C). We explored various unimodal transformer models for Arabic language processing (subtask 2A), visual models for image processing (subtask 2B), and concatenated text and image embeddings using the Multilayer Perceptron fusion module for multimodal propagandistic memes classification (subtask 2C). Our system achieved 77.96% for subtask 2A, 71.04% for subtask 2B, and 79.80% for subtask 2C, ranking 2nd, 1st, and 3rd on the leaderboard.

pdf abs
CUET_sstm at ArAIEval Shared Task: Unimodal (Text) Propagandistic Technique Detection Using Transformer-Based Model
Momtazul Labib | Samia Rahman | Hasan Murad | Udoy Das

In recent days, propaganda has started to influence public opinion increasingly as social media usage continues to grow. Our research has been part of the first challenge, Unimodal (Text) Propagandistic Technique Detection of ArAIEval shared task at the ArabicNLP 2024 conference, co-located with ACL 2024, identifying specific Arabic text spans using twenty-three propaganda techniques. We have augmented underrepresented techniques in the provided dataset using synonym replacement and have evaluated various machine learning (RF, SVM, MNB), deep learning (BiLSTM), and transformer-based models (bert-base-arabic, Marefa-NER, AraBERT) with transfer learning. Our comparative study has shown that the transformer model “bert-base-arabic” has outperformed other models. Evaluating the test set, it has achieved the micro-F1 score of 0.2995 which is the highest. This result has secured our team “CUET_sstm” first place among all participants in task 1 of the ArAIEval.

pdf abs
AlexUNLP-MZ at ArAIEval Shared Task: Contrastive Learning, LLM Features Extraction and Multi-Objective Optimization for Arabic Multi-Modal Meme Propaganda Detection
Mohamed Zaytoon | Nagwa El-Makky | Marwan Torki

The rise of memes as a tool for spreading propaganda presents a significant challenge in the current digital environment. In this paper, we outline our work for the ArAIEval Shared Task2 in ArabicNLP 2024. This study introduces a method for identifying propaganda in Arabic memes using a multimodal system that combines textual and visual indicators to enhance the result. Our approach achieves the first place in text classification with Macro-F1 of 78.69%, the third place in image classification with Macro-F1 of 65.92%, and the first place in multimodal classification with Macro-F1 of 80.51%

Detecting propagandistic spans and identifying persuasion techniques are crucial for promoting informed decision-making, safeguarding democratic processes, and fostering a media environment characterized by integrity and transparency. Various machine learning (Logistic Regression, Random Forest, and Multinomial Naive Bayes), deep learning (CNN, CNN+LSTM, CNN+BiLSTM), and transformer-based (AraBERTv2, AraBERT-NER, CamelBERT, BERT-Base-Arabic) models were exploited to perform the task. The evaluation results indicate that CamelBERT achieved the highest micro-F1 score (24.09%), outperforming CNN+LSTM and AraBERTv2. The study found that most models struggle to detect propagandistic spans when multiple spans are present within the same article. Overall, the model’s performance secured a 6^th place ranking in the ArAIEval Shared Task-1.

pdf abs
SussexAI at ArAIEval Shared Task: Mitigating Class Imbalance in Arabic Propaganda Detection
Mary Fouad | Julie Weeds

In this paper, we are exploring mitigating class imbalancein Arabic propaganda detection. Given amultigenre text which could be a news paragraphor a tweet, the objective is to identify the propagandatechnique employed in the text along withthe exact span(s) where each technique occurs. Weapproach this task as a sequence tagging task. Weutilise AraBERT for sequence classification andimplement data augmentation and random truncationmethods to mitigate the class imbalance withinthe dataset. We demonstrate the importance ofconsidering macro-F1 as well as micro-F1 whenevaluating classifier performance in this scenario.

We present an overview of the FIGNEWSshared task, organized as part of the Arabic-NLP 2024 conference co-located with ACL2024. The shared task addresses bias and pro-paganda annotation in multilingual news posts.We focus on the early days of the Israel War onGaza as a case study. The task aims to fostercollaboration in developing annotation guide-lines for subjective tasks by creating frame-works for analyzing diverse narratives high-lighting potential bias and propaganda. In aspirit of fostering and encouraging diversity,we address the problem from a multilingualperspective, namely within five languages: En-glish, French, Arabic, Hebrew, and Hindi. Atotal of 17 teams participated in two annota-tion subtasks: bias (16 teams) and propaganda(6 teams). The teams competed in four evalua-tion tracks: guidelines development, annotationquality, annotation quantity, and consistency.Collectively, the teams produced 129,800 datapoints. Key findings and implications for thefield are discussed.

This paper presents our team’s contribution to the FIGNEWS 2024 Shared Task, which involved annotating bias and propaganda in news coverage of the Israel-Palestine conflict. We developed comprehensive guidelines and employed a rigorous methodology to analyze 2,200 news posts from several official Facebook accounts of news websites in multiple languages. Our team, Narrative Navigators, achieved third place in both the Bias Guidelines and Bias Consistency tracks, demonstrating the effectiveness of our approach. We achieved an IAA Kappa score of 39.4 for bias annotation and 12.8 for propaganda detection. These findings and our performance underscore the need for enhanced media literacy and further research to counter the impact of biased and misleading information on public understanding of the conflict.

pdf abs
DRAGON at FIGNEWS 2024 Shared Task: a Dedicated RAG for October 7th conflict News
Sadegh Jafari | Mohsen Mahmoodzadeh | Vanooshe Nazari | Razieh Bahmanyar | Kathryn Burrows

In this study, we present a novel approach to annotating bias and propaganda in social media data by leveraging topic modeling techniques. Utilizing the BERTopic tool, we performed topic modeling on the FIGNEWS Shared-task dataset, which initially comprised 13,500 samples. From this dataset, we identified 35 distinct topics and selected approximately 50 representative samples from each topic, resulting in a subset of 1,812 samples. These selected samples were meticulously annotated for bias and propaganda labels. Subsequently, we employed multiple methods like KNN, SVC, XGBoost, and RAG to develop a classifier capable of detecting bias and propaganda within social media content. Our approach demonstrates the efficacy of using topic modeling for efficient data subset selection and provides a robust foundation for improving the accuracy of bias and propaganda detection in large-scale social media datasets.

pdf abs
LexiconLadies at FIGNEWS 2024 Shared Task: Identifying Keywords for Bias Annotation Guidelines of Facebook News Headlines on the Israel-Palestine 2023 War
Yousra El-Ghawi | Abeer Marzouk | Aya Khamis

News bias is difficult for humans to identify, but even more so for machines. This is largely due to the lack of linguistically appropriate annotated datasets suitable for use by classifier algorithms. The FIGNEWS Subtask 1: Bias Annotation involved classifying bias through manually annotated 1800 headlines from social media. Our proposed guidelines investigated which combinations of keywords available for classification, across sentence and token levels, may be used to detect possible bias in a conflict where neutrality is highly undesirable. Much of the headlines’ percentage required contextual knowledge of events to identify criteria that matched biased or targeted language. The final annotation guidelines paved the way for a theoretical system which uses keyword and hashtag significance to classify major instances of bias. Minor instances with bias undertones or clickbait may require advanced machine learning methods which learn context through scraping user engagements on social media.

pdf abs
Uot1 at FIGNEWS 2024 Shared Task: Labeling News Bias
Abdusalam Nwesri | Mai Elbaabaa | Fatima Lashihar | Fatma Alalos

This paper outlines the University of Tripoli’s initiative in creating annotation guidelines to detect bias in news articles concerning the Palestinian-Israeli conflict. Our team participated in the Framing of Israeli Gaza News Media Narrative (FIGNEWS 2024) shared task. We developed annotation guidelines to label bias in news articles. Using those guidelines we managed to annotate 3,900 articles with the aid of our custom-developed annotation tool. Among 16 participating teams, we scored 48.7 on the macro F1 measure in the quality track in which we ranked 4th. In the centrality track we were ranked at the 6th position using the macro F1 avg measure, however, we achieved the 4th best kappa coefficient. Our bias annotation guidelines was ranked in the 9th position.

In this paper, we present our methodology and findings from participating in the FIGNEWS 2024 shared task on annotating news fragments on the Gaza-Israel war for bias and propaganda detection. The task aimed to refine the FIGNEWS 2024 annotation guidelines and to contribute to the creation of a comprehensive dataset to advance research in this field. Our team employed a multi-faceted approach to ensure high accuracy in data annotations. Our results highlight key challenges in detecting bias and propaganda, such as the need for more comprehensive guidelines. Our team ranked first in all tracks for propaganda annotation. For Bias, the team stood in first place for the Guidelines and IAA tracks, and in second place for the Quantity and Consistency tracks.

pdf abs
Bias Bluff Busters at FIGNEWS 2024 Shared Task: Developing Guidelines to Make Bias Conscious
Jasmin Heierli | Silvia Pareti | Serena Pareti | Tatiana Lando

This paper details our participation in the FIGNEWS-2024 shared task on bias and propaganda annotation in Gaza conflict news. Our objectives were to develop robust guidelines and annotate a substantial dataset to enhance bias detection. We iteratively refined our guidelines and used examples for clarity. Key findings include the challenges in achieving high inter-annotator agreement and the importance of annotator awareness of their own biases. We also explored the integration of ChatGPT as an annotator to support consistency. This paper contributes to the field by providing detailed annotation guidelines, and offering insights into the subjectivity of bias annotation.

pdf abs
Ceasefire at FIGNEWS 2024 Shared Task: Automated Detection and Annotation of Media Bias Using Large Language Models
Noor Sadiah | Sara Al-Emadi | Sumaya Rahman

In this paper, we present our approach for FIGNEWS Subtask 1, which focuses on detecting bias in news media narratives about the Israel war on Gaza. We used a Large Language Model (LLM) and prompt engineering, using GPT-3.5 Turbo API, to create a model that automatically flags biased news media content with 99% accuracy. This approach provides Natural Language Processing (NLP) researchers with a robust and effective solution for automating bias detection in news media narratives using supervised learning algorithms. Additionally, this paper provides a detailed analysis of the labeled content, offering valuable insights into media bias in conflict reporting. Our work advances automated content analysis and enhances understanding of media bias.

pdf abs
Sahara Pioneers at FIGNEWS 2024 Shared Task: Data Annotation Guidelines for Propaganda Detection in News Items
Marwa Solla | Hassan Ebrahem | Alya Issa | Harmain Harmain | Abdusalam Nwesri

In today’s digital age, the spread of propaganda through news channels has become a pressing concern. To address this issue, the research community has organized a shared task on detecting propaganda in news posts. This paper aims to present the work carried out at the University of Tripoli for the development and implementation of data annotation guidelines by a team of five annotators. The guidelines were used to annotate 2600 news articles. Each article is labeled as “propaganda”, “Not propaganda”, “Not Applicable”, or “Not clear”. The shared task results put our efforts in the third position among 6 participating teams in the consistency track.

pdf abs
BiasGanda at FIGNEWS 2024 Shared Task: A Quest to Uncover Biased Views in News Coverage
Blqees Blqees | Al Wardi | Malath Al-Sibani | Hiba Al-Siyabi | Najma Zidjaly

In this study, we aimed to identify biased language in a dataset provided by the FIGNEWS 2024 committee on the Gaza-Israel war. We classified entries into seven categories: Unbiased, Biased against Palestine, Biased against Israel, Biased against Others, Biased against both Palestine and Israel, Unclear, and Not Applicable. Our team reviewed the literature to develop a codebook of terminologies and definitions. By coding each example, we sought to detect language tendencies used by media outlets when reporting on the same event. The primary finding was that most examples were classified as “Biased against Palestine,” as all examined language data used one-sided terms to describe the October 7 event. The least used category was “Not Applicable,” reserved for irrelevant examples or those lacking context. It is recommended to use neutral and balanced language when reporting volatile political news.

pdf abs
The CyberEquity Lab at FIGNEWS 2024 Shared Task: Annotating a Corpus of Facebook Posts to Label Bias and Propaganda in Gaza-Israel War Coverage in Five Languages
Mohammed Helal | Radi Jarrar | Mohammed Alkhanafseh | Abdallah Karakra | Ruba Awadallah

This paper presents The_CyberEquity_Lab team’s participation in the FIGNEWS 2024 Shared Task (Zaghouani, et al., 2024). The task is to annotate a corpus of Facebook posts into bias and propaganda in covering the Gaza-Israel war. The posts represent news articles written in five different languages. The paper presents the guidelines of annotation that the team has adhered in identifying both bias and propaganda in coverage of this continuous conflict.

pdf abs
BSC-LANGTECH at FIGNEWS 2024 Shared Task: Exploring Semi-Automatic Bias Annotation using Frame Analysis
Valle Ruiz-Fernández | José Saiz | Aitor Gonzalez-Agirre

This paper introduces the methodology of BSC-LANGTECH team for the FIGNEWS 2024 Shared Task on News Media Narratives. Following the bias annotation subtask, we apply the theory and methods of framing analysis to develop guidelines to annotate bias in the corpus provided by the task organizators. The manual annotation of a subset, with which a moderate IAA agreement has been achieved, is further used in Deep Learning techniques to explore automatic annotation and test the reliability of our framework.

In this paper we report the development of our annotation methodology for the shared task FIGNEWS 2024. The objective of the shared task is to look into the layers of bias in how the war on Gaza is represented in media narrative. Our methodology follows the prescriptive paradigm, in which guidelines are detailed and refined through an iterative process in which edge cases are discussed and converged. Our IAA score (Krippendorff’s 𝛼) is 0.420, highlighting the challenging and subjective nature of the task. Our results show that 52% of posts were unbiased, 42% biased against Palestine, 5% biased against Israel, and 3% biased against both. 16% were unclear or not applicable.

pdf abs
Sina at FigNews 2024: Multilingual Datasets Annotated with Bias and Propaganda.
Lina Duaibes | Areej Jaber | Mustafa Jarrar | Ahmad Qadi | Mais Qandeel

The proliferation of bias and propaganda onsocial media is an increasingly significant concern,leading to the development of techniquesfor automatic detection. This article presents amultilingual corpus of 12, 000 Facebook postsfully annotated for bias and propaganda. Thecorpus was created as part of the FigNews2024 Shared Task on News Media Narrativesfor framing the Israeli War on Gaza. It coversvarious events during the War from October7, 2023 to January 31, 2024. The corpuscomprises 12, 000 posts in five languages (Arabic,Hebrew, English, French, and Hindi), with2, 400 posts for each language. The annotationprocess involved 10 graduate students specializingin Law. The Inter-Annotator Agreement(IAA) was used to evaluate the annotationsof the corpus, with an average IAA of 80.8%for bias and 70.15% for propaganda annotations.Our team was ranked among the bestperformingteams in both Bias and Propagandasubtasks. The corpus is open-source and availableat https://sina.birzeit.edu/fada

pdf abs
SQUad at FIGNEWS 2024 Shared Task: Unmasking Bias in Social Media Through Data Analysis and Annotation
Asmahan Al-Mamari | Fatma Al-Farsi | Najma Zidjaly

This paper is a part of the FIGNEWS 2024 Datathon Shared Task and it aims to investigate bias and double standards in media coverage of the Gaza-Israel 2023-2024 conflict through a comprehensive analysis of news articles. The methodology integrated both manual labeling as well as the application of a natural language processing (NLP) tool, which is the Facebook/BART-large-MNLI model. The annotation process involved categorizing the dataset based on identified biases, following a set of guidelines in which categories of bias were defined by the team. The findings revealed that most of the media texts provided for analysis included bias against Palestine, whether it was through the use of biased vocabulary or even tone. It was also found that texts written in Hebrew contained the most bias against Palestine. In addition, when comparing annotations done by AAI-1 and AAI-2, the results turned out to be very similar, which might be mainly due to the clear annotation guidelines set by the annotators themselves. Thus, we recommend the use of clear guidelines to facilitate the process of annotation by future researchers.

pdf abs
JusticeLeague at FIGNEWS 2024 Shared Task: Innovations in Bias Annotation
Amr Saleh | Huda Mohamed | Hager Sayed

In response to the evolving media representation of the Gaza-Israel conflict, this study aims to categorize news articles based on their bias towards specific entities. Our primary objective is to annotate news articles with labels that indicate their bias: “Unbiased”, “Biased against Palestine”, “Biased against Israel”, “Biased against both Palestine and Israel”, “Biased against others”, “Unclear”, or “Not Applicable”.The methodology involves a detailed annotation process where each article is carefully reviewed and labeled according to predefined guidelines. For instance, an article reporting factual events without derogatory language is labeled as “Unbiased”, while one using inflammatory language against Palestinians is marked as “Biased against Palestine”.Key findings include the identification of various degrees of bias in news articles, highlighting the importance of critical analysis in media consumption. This research contributes to the broader effort of understanding media bias and promoting unbiased journalism. Tools such as Google Drive and Google Sheets facilitated the annotation process, enabling efficient collaboration and data management among the annotators.Our work also includes comprehensive guidelines and examples to ensure consistent annotation, enhancing the reliability of the data.

pdf abs
Eagles at FIGNEWS 2024 Shared Task: A Context-informed Prescriptive Approach to Bias Detection in Contentious News Narratives
Amanda Chan | Mai A.Baddar | Sofien Baazaoui

This research paper presents an in-depth examination of bias identification in media content related to the Israel-Palestine war. Focusing on the annotation guidelines and process developed by our team of researchers, the document outlines a systematic approach to discerning bias in articles. Through meticulous analysis, key indicators of bias such as emotive language, weasel words, and loaded comparisons are identified and discussed. The paper also explores the delineation between facts and opinions, emphasizing the importance of maintaining objectivity in annotation. Ethical considerations, including the handling of sensitive data and the promotion of multipartiality among annotators, are carefully addressed. The annotation guidelines also include other ethical considerations such as identifying rumors, false information, exercising prudence and selective quotations. The research paper offers insights into the annotation experience, highlighting common mistakes and providing valuable guidelines for future research in bias identification. By providing a comprehensive framework for evaluating bias in media coverage of the Israel-Palestine war, this study contributes to a deeper understanding of the complexities inherent in media discourse surrounding contentious geopolitical issues.

pdf abs
The Guidelines Specialists at FIGNEWS 2024 Shared Task: An annotation guideline to Unravel Bias in News Media Narratives Using a Linguistic Approach
Ghizlane Bourahouat | Samar Amer

This article presents the participation of “The Guideline Specialists” in the FIGNEWS 2024 Shared Task, which aims to unravel bias and propaganda in news media narratives surrounding the Gaza-Israel 2023-2024 war. Leveraging innovative annotation methodologies and drawing on a diverse team of annotators, our approach focuses on meticulously annotating news articles using a linguistic approach to uncover the intricate nuances of bias. By incorporating detailed examples and drawing on related work that show how language structure represented in the use of passive voice or the use of nominalization and the choice of vocabulary carry bias, our findings provide valuable insights into the representation of the Gaza-Israel conflict across various languages and cultures. The guideline we developed detected the bias against Gaza, against Israel and others by setting keywords that are based on linguistic background tested by the AntConc concordance tool. The result was an annotation guideline that have a solid base. Through this collaborative effort, we developed a guideline that contributes to fostering a deeper understanding of media narratives during one of the most critical moments in recent history.

This paper outlines the KSAA-CAD shared task, highlighting the Contemporary Arabic Language Dictionary within the scenario of developing a Reverse Dictionary (RD) system and enhancing Word Sense Disambiguation (WSD) capabilities. The first KSAA-RD (Al-Matham et al., 2023) highlighted significant gaps in the domain of RDs, which are designed to retrieve words by their meanings or definitions. This shared task comprises two tasks: RD and WSD. The RD task focuses on identifying word embeddings that most accurately match a given definition, termed a “gloss,” in Arabic. Conversely, the WSD task involves determining the specific meaning of a word in context, particularly when the word has multiple meanings. The winning team achieved the highest-ranking score of 0.0644 in RD using Electra embeddings. In this paper, we describe the methods employed by the participating teams and provide insights into the future direction of KSAA-CAD.

pdf abs
Cher at KSAA-CAD 2024: Compressing Words and Definitions into the Same Space for Arabic Reverse Dictionary
Pinzhen Chen | Zheng Zhao | Shun Shao

We present Team Cher’s submission to the ArabicNLP 2024 KSAA-CAD shared task on the reverse dictionary for Arabic—the retrieval of words using definitions as a query. Our approach is based on a multi-task learning framework that jointly learns reverse dictionary, definition generation, and reconstruction tasks. This work explores different tokenization strategies and compares retrieval performance for each embedding architecture. Evaluation using the KSAA-CAD benchmark demonstrates the effectiveness of our multi-task approach and provides insights into the reverse dictionary task for Arabic. It is worth highlighting that we achieve strong performance without using any external resources in addition to the provided training data.

pdf abs
MISSION at KSAA-CAD 2024: AraT5 with Arabic Reverse Dictionary
Thamer Alharbi

This research paper presents our approach for the KSAA-CAD 2024 competition, focusing on Arabic Reverse Dictionary (RD) task (Alshammari et al., 2024). Leveraging the functionalities of the Arabic Reverse Dictionary, our system allows users to input glosses and retrieve corresponding words. We provide all associated notebooks and developed models on GitHub and Hugging face, respectively. Our task entails working with a dataset comprising dictionary data and word embedding vectors, utilizing three different architectures of contextualized word embeddings: AraELECTRA, AraBERTv2, and camelBERT-MSA. We fine-tune the AraT5v2-base-1024 model for predicting each embedding, considering various hyperparameters for training and validation. Evaluation metrics include ranking accuracy, mean squared error (MSE), and cosine similarity. The results demonstrate the effectiveness of our approach on both development and test datasets, showcasing promising performance across different embedding types.

Semantic search tasks have grown extremely fast following the advancements in large language models, including the Reverse Dictionary and Word Sense Disambiguation in Arabic. This paper describes our participation in the Contemporary Arabic Dictionary Shared Task. We propose two models that achieved first place in both tasks. We conducted comprehensive experiments on the latest five multilingual sentence transformers and the Arabic BERT model for semantic embedding extraction. We achieved a ranking score of 0.06 for the reverse dictionary task, which is double than last year’s winner. We had an accuracy score of 0.268 for the Word Sense Disambiguation task.

pdf abs
Baleegh at KSAA-CAD 2024: Towards Enhancing Arabic Reverse Dictionaries
Mais Alheraki | Souham Meshoul

The domain of reverse dictionaries (RDs), while advancing in languages like English and Chinese, remains underdeveloped for Arabic. This study attempts to explore a data-driven approach to enhance word retrieval processes in Arabic RDs. The research focuses on the ArabicNLP 2024 Shared Task, named KSAA-CAD, which provides a dictionary dataset of 39,214 word-gloss pairs, each with a corresponding target word embedding. The proposed solution aims to surpass the baseline performance by employing SOTA deep learning models and innovative data expansion techniques. The methodology involves enriching the dataset with contextually relevant examples, training a T5 model to align the words to their glosses in the space, and evaluating the results on the shared task metrics. We find that our model is closely aligned with the baseline performance on bertseg and bertmsa targets, however does not perform well on electra target, suggesting the need for further exploration.

We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI’s objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on prespecified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask 1), identification of the Arabic level of dialectness (Subtask 2), and dialect-to-MSA machine translation (Subtask 3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask 1, three in Subtask 2, and eight in Subtask 3. The winning teams achieved 50.57 F1 on Subtask 1, 0.1403 RMSE for Subtask 2, and 20.44 BLEU in Subtask 3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

pdf abs
Arabic Train at NADI 2024 shared task: LLMs’ Ability to Translate Arabic Dialects into Modern Standard Arabic
Anastasiia Demidova | Hanin Atwany | Nour Rabih | Sanad Sha’ban

Navigating the intricacies of machine translation (MT) involves tackling the nuanced disparities between Arabic dialects and Modern Standard Arabic (MSA), presenting a formidable obstacle. In this study, we delve into Subtask 3 of the NADI shared task (CITATION), focusing on the translation of sentences from four distinct Arabic dialects into MSA. Our investigation explores the efficacy of various models, including Jais, NLLB, GPT-3.5, and GPT-4, in this dialect-to-MSA translation endeavor. Our findings reveal that Jais surpasses all other models, boasting an average BLEU score of 19.48 in the combination of zero- and few-shot setting, whereas NLLB exhibits the least favorable performance, garnering a BLEU score of 8.77.

pdf abs
AlexUNLP-STM at NADI 2024 shared task: Quantifying the Arabic Dialect Spectrum with Contrastive Learning, Weighted Sampling, and BERT-based Regression Ensemble
Abdelrahman Sakr | Marwan Torki | Nagwa El-Makky

Recognizing the nuanced spectrum of dialectness in Arabic text poses a significant challenge for natural language processing (NLP) tasks. Traditional dialect identification (DI) methods treat the task as binary, overlooking the continuum of dialect variation present in Arabic speech and text. In this paper, we describe our submission to the NADI shared Task of ArabicNLP 2024. We participated in Subtask 2 - ALDi Estimation, which focuses on estimating the Arabic Level of Dialectness (ALDi) for Arabic text, indicating how much it deviates from Modern Standard Arabic (MSA) on a scale from 0 to 1, where 0 means MSA and 1 means high divergence from MSA. We explore diverse training approaches, including contrastive learning, applying a random weighted sampler along with fine-tuning a regression task based on the AraBERT model, after adding a linear and non-linear layer on top of its pooled output. Finally, performing a brute force ensemble strategy increases the performance of our system. Our proposed solution achieved a Root Mean Squared Error (RMSE) of 0.1406, ranking second on the leaderboard.

pdf abs
NLP_DI at NADI 2024 shared task: Multi-label Arabic Dialect Classifications with an Unsupervised Cross-Encoder
Vani Kanjirangat | Tanja Samardzic | Ljiljana Dolamic | Fabio Rinaldi

We report the approaches submitted to the NADI 2024 Subtask 1: Multi-label country-level Dialect Identification (MLDID). The core part was to adapt the information from multi-class data for a multi-label dialect classification task. We experimented with supervised and unsupervised strategies to tackle the task in this challenging setting. Under the supervised setup, we used the model trained using NADI 2023 data and devised approaches to convert the multi-class predictions to multi-label by using information from the confusion matrix or using calibrated probabilities. Under unsupervised settings, we used the Arabic-based sentence encoders and multilingual cross-encoders to retrieve similar samples from the training set, considering each test input as a query. The associated labels are then assigned to the input query. We also tried different variations, such as co-occurring dialects derived from the provided development set. We obtained the best validation performance of 48.5% F-score using one of the variations with an unsupervised approach and the same approach yielded the best test result of 43.27% (Ranked 2).

pdf abs
ASOS at NADI 2024 shared task: Bridging Dialectness Estimation and MSA Machine Translation for Arabic Language Enhancement
Omer Nacar | Serry Sibaee | Abdullah Alharbi | Lahouari Ghouti | Anis Koubaa

This study undertakes a comprehensive investigation of transformer-based models to advance Arabic language processing, focusing on two pivotal aspects: the estimation of Arabic Level of Dialectness and dialectal sentence-level machine translation into Modern Standard Arabic. We conducted various evaluations of different sentence transformers across a proposed regression model, showing that the MARBERT transformer-based proposed regression model achieved the best root mean square error of 0.1403 for Arabic Level of Dialectness estimation. In parallel, we developed bi-directional translation models between Modern Standard Arabic and four specific Arabic dialects—Egyptian, Emirati, Jordanian, and Palestinian—by fine-tuning and evaluating different sequence-to-sequence transformers. This approach significantly improved translation quality, achieving a BLEU score of 0.1713. We also enhanced our evaluation capabilities by integrating MSA predictions from the machine translation model into our Arabic Level of Dialectness estimation framework, forming a comprehensive pipeline that not only demonstrates the effectiveness of our methodologies but also establishes a new benchmark in the deployment of advanced Arabic NLP technologies.

pdf abs
dzNLP at NADI 2024 Shared Task: Multi-Classifier Ensemble with Weighted Voting and TF-IDF Features
Mohamed Lichouri | Khaled Lounnas | Zahaf Nadjib | Rabiai Ayoub

This paper presents the contribution of our dzNLP team to the NADI 2024 shared task, specifically in Subtask 1 - Multi-label Country-level Dialect Identification (MLDID) (Closed Track). We explored various configurations to address the challenge: in Experiment 1, we utilized a union of n-gram analyzers (word, character, character with word boundaries) with different n-gram values; in Experiment 2, we combined a weighted union of Term Frequency-Inverse Document Frequency (TF-IDF) features with various weights; and in Experiment 3, we implemented a weighted major voting scheme using three classifiers: Linear Support Vector Classifier (LSVC), Random Forest (RF), and K-Nearest Neighbors (KNN).Our approach, despite its simplicity and reliance on traditional machine learning techniques, demonstrated competitive performance in terms of accuracy and precision. Notably, we achieved the highest precision score of 63.22% among the participating teams. However, our overall F1 score was approximately 21%, significantly impacted by a low recall rate of 12.87%. This indicates that while our models were highly precise, they struggled to recall a broad range of dialect labels, highlighting a critical area for improvement in handling diverse dialectal variations.

pdf abs
ELYADATA at NADI 2024 shared task: Arabic Dialect Identification with Similarity-Induced Mono-to-Multi Label Transformation.
Amira Karoui | Farah Gharbi | Rami Kammoun | Imen Laouirine | Fethi Bougares

This paper describes our submissions to the Multi-label Country-level Dialect Identification subtask of the NADI2024 shared task, organized during the second edition of the ArabicNLP conference. Our submission is based on the ensemble of fine-tuned BERT-based models, after implementing the Similarity-Induced Mono-to-Multi Label Transformation (SIMMT) on the input data. Our submission ranked first with a Macro-Average (MA) F1 score of 50.57%.

pdf abs
Alson at NADI 2024 shared task: Alson - A fine-tuned model for Arabic Dialect Translation
Manan AlMusallam | Samar Ahmad

DA-MSA Machine Translation is a recentchallenge due to the multitude of Arabic dialects and their variations. In this paper, we present our results within the context of Subtask 3 of the NADI-2024 Shared Task(Abdul-Mageed et al., 2024) that is DA-MSA Machine Translation . We utilized the DIALECTS008MSA MADAR corpus (Bouamor et al., 2018),the Emi-NADI corpus for the Emirati dialect (Khered et al., 2023), and we augmented thePalestinian and Jordanian datasets based onNADI 2021. Our approach involves develop013ing sentence-level machine translations fromPalestinian, Jordanian, Emirati, and Egyptiandialects to Modern Standard Arabic (MSA).To016 address this challenge, we fine-tuned models such as (Nagoudi et al., 2022)AraT5v2-msa-small, AraT5v2-msa-base, and (Elmadanyet al., 2023)AraT5v2-base-1024 to comparetheir performance. Among these, the AraT5v2-base-1024 model achieved the best accuracy, with a BLEU score of 0.1650 on the develop023ment set and 0.1746 on the test set.

pdf abs
CUFE at NADI 2024 shared task: Fine-Tuning Llama-3 To Translate From Arabic Dialects To Modern Standard Arabic
Michael Ibrahim

LLMs such as GPT-4 and LLaMA excel in multiple natural language processing tasks, however, LLMs face challenges in delivering satisfactory performance on low-resource languages due to limited availability of training data. In this paper, LLaMA-3 with 8 Billion parameters is finetuned to translate among Egyptian, Emirati, Jordanian, Palestinian Arabic dialects, and Modern Standard Arabic (MSA). In the NADI 2024 Task on DA-MSA Machine Translation, the proposed method achieved a BLEU score of 21.44 when it was fine-tuned on thedevelopment dataset of the NADI 2024 Task on DA-MSA and a BLEU score of 16.09 when trained when it was fine-tuned using the OSACT dataset.

pdf abs
StanceEval 2024: The First Arabic Stance Detection Shared Task
Nora Alturayeif | Hamzah Luqman | Zaid Alyafeai | Asma Yamani

Recently, there has been a growing interest in analyzing user-generated text to understand opinions expressed on social media. In NLP, this task is known as stance detection, where the goal is to predict whether the writer is in favor, against, or has no opinion on a given topic. Stance detection is crucial for applications such as sentiment analysis, opinion mining, and social media monitoring, as it helps in capturing the nuanced perspectives of users on various subjects. As part of the ArabicNLP 2024 program, we organized the first shared task on Arabic Stance Detection, StanceEval 2024. This initiative aimed to foster advancements in stance detection for the Arabic language, a relatively underrepresented area in Arabic NLP research. This overview paper provides a detailed description of the shared task, covering the dataset, the methodologies used by various teams, and a summary of the results from all participants. We received 28 unique team registrations, and during the testing phase, 16 teams submitted valid entries. The highest classification F-score obtained was 84.38.

pdf abs
Team_Zero at StanceEval2024: Frozen PLMs for Arabic Stance Detection
Omar Galal | Abdelrahman Kaseb

This research explores the effectiveness of using pre-trained language models (PLMs) as feature extractors for Arabic stance detection on social media, focusing on topics like women empowerment, COVID-19 vaccination, and digital transformation. By leveraging sentence transformers to extract embeddings and incorporating aggregation architectures on top of BERT, we aim to achieve high performance without the computational expense of fine-tuning. Our approach demonstrates significant resource and time savings while maintaining competitive performance, scoring an F1-score of 78.62 on the test set. This study highlights the potential of PLMs in enhancing stance detection in Arabic social media analysis, offering a resource-efficient alternative to traditional fine-tuning methods.

pdf abs
ANLP RG at StanceEval2024: Comparative Evaluation of Stance, Sentiment and Sarcasm Detection
Mezghani Amal | Rahma Boujelbane | Mariem Ellouze

As part of our study, we worked on three tasks:stance detection, sarcasm detection and senti-ment analysis using fine-tuning techniques onBERT-based models. Fine-tuning parameterswere carefully adjusted over multiple iterationsto maximize model performance. The threetasks are essential in the field of natural lan-guage processing (NLP) and present uniquechallenges. Stance detection is a critical taskaimed at identifying a writer’s stances or view-points in relation to a topic. Sarcasm detectionseeks to spot sarcastic expressions, while senti-ment analysis determines the attitude expressedin a text. After numerous experiments, we iden-tified Arabert-twitter as the model offering thebest performance for all three tasks. In particu-lar, it achieves a macro F-score of 78.08% forstance detection, a macro F1-score of 59.51%for sarcasm detection and a macro F1-score of64.57% for sentiment detection. .Our source code is available at https://github.com/MezghaniAmal/Mawqif

pdf abs
dzStance at StanceEval2024: Arabic Stance Detection based on Sentence Transformers
Mohamed Lichouri | Khaled Lounnas | Ouaras Rafik | Mohamed ABi | Anis Guechtouli

This study compares Term Frequency-Inverse Document Frequency (TF-IDF) features with Sentence Transformers for detecting writers’ stances—favorable, opposing, or neutral—towards three significant topics: COVID-19 vaccine, digital transformation, and women empowerment. Through empirical evaluation, we demonstrate that Sentence Transformers outperform TF-IDF features across various experimental setups. Our team, dzStance, participated in a stance detection competition, achieving the 13th position (74.91%) among 15 teams in Women Empowerment, 10th (73.43%) in COVID Vaccine, and 12th (66.97%) in Digital Transformation. Overall, our team’s performance ranked 13th (71.77%) among all participants. Notably, our approach achieved promising F1-scores, highlighting its effectiveness in identifying writers’ stances on diverse topics. These results underscore the potential of Sentence Transformers to enhance stance detection models for addressing critical societal issues.

pdf abs
SMASH at StanceEval 2024: Prompt Engineering LLMs for Arabic Stance Detection
Youssef Hariri | Ibrahim Abu Farha

This paper presents our submission for the Stance Detection in Arabic Language (StanceEval) 2024 shared task conducted by Team SMASH of the University of Edinburgh. We evaluated the performance of various BERT-based and large language models (LLMs). MARBERT demonstrates superior performance among the BERT-based models, achieving F1 and macro-F1 scores of 0.570 and 0.770, respectively. In contrast, Command R model outperforms all models with the highest overall F1 score of 0.661 and macro F1 score of 0.820.

pdf abs
CUFE at StanceEval2024: Arabic Stance Detection with Fine-Tuned Llama-3 Model
Michael Ibrahim

In NLP, stance detection identifies a writer’s position or viewpoint on a particular topic or entity from their text and social media activity, which includes preferences and relationships.Researchers have been exploring techniques and approaches to develop effective stance detection systems.Large language models’ latest advancements offer a more effective solution to the stance detection problem. This paper proposes fine-tuning the newly released 8B-parameter Llama 3 model from Meta GenAI for Arabic text stance detection.The proposed method was ranked ninth in the StanceEval 2024 Task on stance detection in Arabic language achieving a Macro average F₁ score of 0.7647.

pdf abs
StanceCrafters at StanceEval2024: Multi-task Stance Detection using BERT Ensemble with Attention Based Aggregation
Ahmed Hasanaath | Aisha Alansari

Stance detection is a key NLP problem that classifies a writer’s viewpoint on a topic based on their writing. This paper outlines our approach for Stance Detection in Arabic Language Shared Task (StanceEval2024), focusing on attitudes towards the COVID-19 vaccine, digital transformation, and women’s empowerment. The proposed model uses parallel multi-task learning with two fine-tuned BERT-based models combined via an attention module. Results indicate this ensemble outperforms a single BERT model, demonstrating the benefits of using BERT architectures trained on diverse datasets. Specifically, Arabert-Twitterv2, trained on tweets, and Camel-Lab, trained on Modern Standard Arabic (MSA), Dialectal Arabic (DA), and Classical Arabic (CA), allowed us to leverage diverse Arabic dialects and styles.

pdf abs
MGKM at StanceEval2024 Fine-Tuning Large Language Models for Arabic Stance Detection
Mamoun Alghaslan | Khaled Almutairy

Social media platforms have become essential in daily life, enabling users to express their opinions and stances on various topics. Stance detection, which identifies the viewpoint expressed in text toward a target, has predominantly focused on English. MAWQIF is the pioneering Arabic dataset for target-specific stance detection, consisting of 4,121 tweets annotated with stance, sentiment, and sarcasm. The original dataset, benchmarked on four BERT-based models, achieved a best macro-F1 score of 78.89, indicating significant room for improvement. This study evaluates the effectiveness of three Large Language Models (LLMs) in detecting target-specific stances in MAWQIF. The LLMs assessed are ChatGPT-3.5-turbo, Meta-Llama-3-8B-Instruct, and Falcon-7B-Instruct. Performance was measured using both zero-shot and full fine-tuning approaches. Our findings demonstrate that fine-tuning substantially enhances the stance detection capabilities of LLMs in Arabic tweets. Notably, GPT-3.5-Turbo achieved the highest performance with a macro-F1 score of 82.93, underscoring the potential of fine-tuned LLMs for language-specific applications.

pdf abs
AlexUNLP-BH at StanceEval2024: Multiple Contrastive Losses Ensemble Strategy with Multi-Task Learning For Stance Detection in Arabic
Mohamed Badran | Mo’men Hamdy | Marwan Torki | Nagwa El-Makky

Stance detection, an evolving task in natural language processing, involves understanding a writer’s perspective on certain topics by analyzing his written text and interactions online, especially on social media platforms. In this paper, we outline our submission to the StanceEval task, leveraging the Mawqif dataset featured in The Second Arabic Natural Language Processing Conference. Our task is to detect writers’ stances (Favor, Against, or None) towards three selected topics (COVID-19 vaccine, digital transformation, and women empowerment). We present our approach primarily relying on a contrastive loss ensemble strategy. Our proposed approach achieved an F1-score of 0.8438 and ranked first in the stanceEval 2024 task. The code and checkpoints are availableat https://github.com/MBadran2000/Mawqif.git

pdf abs
Rasid at StanceEval: Fine-tuning MARBERT for Arabic Stance Detection
Nouf AlShenaifi | Nourah Alangari | Hadeel Al-Negheimish

As social media usage continues to rise, the demand for systems to analyze opinions and sentiments expressed in textual data has become more critical. This paper presents our submission to the Stance Detection in Arabic Language Shared Task, in which we evaluated three models: the fine-tuned MARBERT Transformer, the fine-tuned AraBERT Transformer, and an Ensemble of Machine learning Classifiers. Our findings indicate that the MARBERT Transformer outperformed the other models in performance across all targets. In contrast, the Ensemble Classifier, which combines traditional machine learning techniques, demonstrated relatively lower effectiveness.

pdf abs
ISHFMG_TUN at StanceEval: Ensemble Method for Arabic Stance Evaluation System
Mustapha Jaballah

It is essential to understand the attitude of individuals towards specific topics in Arabic language for tasks like sentiment analysis, opinion mining, and social media monitoring. However, the diversity of the linguistic characteristics of the Arabic language presents several challenges to accurately evaluate the stance. In this study, we suggest ensemble approach to tackle these challenges. Our method combines different classifiers using the voting method. Through multiple experiments, we prove the effectiveness of our method achieving significant F1-score value equal to 0.7027. Our findings contribute to promoting NLP and offer treasured enlightenment for applications like sentiment analysis, opinion mining, and social media monitoring.

pdf abs
PICT at StanceEval2024: Stance Detection in Arabic using Ensemble of Large Language Models
Ishaan Shukla | Ankit Vaidya | Geetanjali Kale

This paper outlines our approach to the StanceEval 2024- Arabic Stance Evaluation shared task. The goal of the task was to identify the stance, one out of three (Favor, Against or None) towards tweets based on three topics, namely- COVID-19 Vaccine, Digital Transformation and Women Empowerment. Our approach consists of fine-tuning BERT-based models efficiently for both, Single-Task Learning as well as Multi-Task Learning, the details of which are discussed. Finally, an ensemble was implemented on the best-performing models to maximize overall performance. We achieved a macro F1 score of 78.02% in this shared task. Our codebase is available publicly.

pdf abs
TAO at StanceEval2024 Shared Task: Arabic Stance Detection using AraBERT
Anas Melhem | Osama Hamed | Thaer Sammar

In this paper, we present a high-performing model for Arabic stance detection on the STANCEEVAL2024 shared task part ofARABICNLP2024. Our model leverages ARABERTV1; a pre-trained Arabic language model, within a single-task learning framework. We fine-tuned the model on stance detection data for three specific topics: COVID19 vaccine, digital transformation, and women empowerment, extracted from the MAWQIF corpus. In terms of performance, our model achieves 73.30 macro-F1 score for women empowerment, 70.51 for digital transformation, and 64.55 for COVID-19 vaccine detection.

We present WojoodNER-2024, the second Arabic Named Entity Recognition (NER) Shared Task. In WojoodNER-2024, we focus on fine-grained Arabic NER. We provided participants with a new Arabic fine-grained NER dataset called Wojoodfine, annotated with subtypes of entities. WojoodNER-2024 encompassed three subtasks: (i) Closed-Track Flat Fine-Grained NER, (ii) Closed-Track Nested Fine-Grained NER, and (iii) an Open-Track NER for the Israeli War on Gaza. A total of 43 unique teams registered for this shared task. Five teams participated in the Flat Fine-Grained Subtask, among which two teams tackled the Nested Fine-Grained Subtask and one team participated in the Open-Track NER Subtask. The winning teams achieved F₁ scores of 91% and 92% in the Flat Fine-Grained and Nested Fine-Grained Subtasks, respectively. The sole team in the Open-Track Subtask achieved an F₁ score of 73.7%.

This paper presents our system “muNERa”, submitted to the WojoodNER 2024 shared task at the second ArabicNLP conference. We participated in two subtasks, the flat and nested fine-grained NER sub-tasks (1 and 2). muNERa achieved first place in the nested NER sub-task and second place in the flat NER sub-task. The system is based on the TANL framework (CITATION),by using a sequence-to-sequence structured language translation approach to model both tasks. We utilize the pre-trained AraT5v2-base model as the base model for the TANL framework. The best-performing muNERa model achieves 91.07% and 90.26% for the F-1 scores on the test sets for the nested and flat subtasks, respectively.

pdf abs
Addax at WojoodNER 2024: Attention-Based Dual-Channel Neural Network for Arabic Named Entity Recognition
Issam Yahia | Houdaifa Atou | Ismail Berrada

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that focuses on extracting entities such as names of people, organizations, locations, and dates from text. Despite significant advancements due to deep learning and transformer architectures like BERT, NER still faces challenges, particularly in low-resource languages like Arabic. This paper presents a BERT-based NER system that utilizes a two-channel parallel hybrid neural network with an attention mechanism specifically designed for the NER Shared Task 2024. In the competition, our approach ranked second by scoring 90.13% in micro-F1 on the test set. The results demonstrate the effectiveness of combining advanced neural network architectures with contextualized word embeddings in improving NER performance for Arabic.

In this paper, we present our submission for the WojoodNER 2024 Shared Tasks addressing flat and nested sub-tasks (1, 2). We experiment with three different approaches. We train (i) an Arabic fine-tuned version of BLOOMZ-7b-mt, GEMMA-7b, and AraBERTv2 on multi-label token classifications task; (ii) two AraBERTv2 models, on main types and sub-types respectively; and (iii) one model for main types and four for the four sub-types. Based on the Wojood NER 2024 test set results, the three fine-tuned models performed similarly with AraBERTv2 favored (F1: Flat=.8780 Nested=.9040). The five model approach performed slightly better (F1: Flat=.8782 Nested=.9043).

pdf abs
Bangor University at WojoodNER 2024: Advancing Arabic Named Entity Recognition with CAMeLBERT-Mix
Norah Alshammari

This paper describes the approach and results of Bangor University’s participation in the WojoodNER 2024 shared task, specifically for Subtask-1: Closed-Track Flat Fine-Grain NER. We present a system utilizing a transformer-based model called bert-base-arabic-camelbert-mix, fine-tuned on the Wojood-Fine corpus. A key enhancement to our approach involves adding a linear layer on top of the bert-base-arabic-camelbert-mix to classify each token into one of 51 different entity types and subtypes, as well as the ‘O’ label for non-entity tokens. This linear layer effectively maps the contextualized embeddings produced by BERT to the desired output labels, addressing the complex challenges of fine-grained Arabic NER. The system achieved competitive results in precision, recall, and F1 scores, thereby contributing significant insights into the application of transformers in Arabic NER tasks.

This paper details our submission to the WojoodNER Shared Task 2024, leveraging in-context learning with large language models for Arabic Named Entity Recognition. We utilized the Command R model, to perform fine-grained NER on the Wojood-Fine corpus. Our primary approach achieved an F1 score of 0.737 and a recall of 0.756. Post-processing the generated predictions to correct format inconsistencies resulted in an increased recall of 0.759, and a similar F1 score of 0.735. A multi-level prompting method and aggregation of outputs resulted in a lower F1 score of 0.637. Our results demonstrate the potential of ICL for Arabic NER while highlighting challenges related to LLM output consistency.

pdf abs
mucAI at WojoodNER 2024: Arabic Named Entity Recognition with Nearest Neighbor Search
Ahmed Abdou | Tasneem Mahmoud

Named Entity Recognition (NER) is a task in Natural Language Processing (NLP) that aims to identify and classify entities in text into predefined categories.However, when applied to Arabic data, NER encounters unique challenges stemming from the language’s rich morphological inflections, absence of capitalization cues, and spelling variants, where a single word can comprise multiple morphemes.In this paper, we introduce Arabic KNN-NER, our submission to the Wojood NER Shared Task 2024 (ArabicNLP 2024). We have participated in the shared sub-task 1 Flat NER. In this shared sub-task, we tackle fine-grained flat-entity recognition for Arabic text, where we identify a single main entity and possibly zero or multiple sub-entities for each word.Arabic KNN-NER augments the probability distribution of a fine-tuned model with another label probability distribution derived from performing a KNN search over the cached training data. Our submission achieved 91% on the test set on the WojoodFine dataset, placing Arabic KNN-NER on top of the leaderboard for the shared task.