Fadi A. Zaraket
Also published as: Fadi Zaraket
2026
NeoAraBERT: A Modern Foundation Model for Arabic Embeddings with Diacritics-Aware Tokenization and POS-Targeted Masking
Chadi Abou Chakra | Hadi Khaled Hamoud | Osama Rakan Al Mraikhat | Qusai Abu Obaida | Mohamad Ballout | Fadi Zaraket
Findings of the Association for Computational Linguistics: ACL 2026
Chadi Abou Chakra | Hadi Khaled Hamoud | Osama Rakan Al Mraikhat | Qusai Abu Obaida | Mohamad Ballout | Fadi Zaraket
Findings of the Association for Computational Linguistics: ACL 2026
We present NeoAraBERT, a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pre-train NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed more general POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a novel synonym-based task, "Muradif", that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants (MSA, dialectal, and mixed) rank first in 18 tasks, second in two, third in two, and fourth in one task. They show strong performance on classical and modern standard Arabic, substantial margins of improvement (>7%) in two tasks, and a +2.75% improvement on average across all tasks. Our code and links to checkpoints for our model variants are available on our website: https://acr.ps/neoarabert.
Back-of-the-Book Index Automation for Arabic Documents
Nawal Haidar | Ahmad Kashmar | Fadi Zaraket
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Nawal Haidar | Ahmad Kashmar | Fadi Zaraket
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Back-of-the-book indexes (BoBIs) are crucial for book readability. However, their manual creation is laborious and error prone. In this paper, we introduce ArBoBIM to automate BoBI extraction and review processes for Arabic books. Given a book with a corresponding BoBI, ArBoBIM extracts BoBI terms and identifies their occurrences and aligns those across several versions of the book. ArBoBIM first defines a pool of candidates for each term by leveraging noun phrases and named entities. ArBoBIM leverages several metrics, including exact matches, morpho-lexical similarity, and semantic similarity, to determine the best candidates. We empirically fine-tuned thresholds for ArBoBIM and achieve an F1-score of 0.94 (precision= 0.97, recall=0.91). These results are significantly better than baseline results, and top LLM based results with lower computational cost and no publishing house IP risks. Additionally, with ArBoBIM, over 500 books have been processed, resulting in the ArBoBIMap dataset, containing books, their terms, occurrences, and various metadata related to them, to be made available for the public. This dataset is used to train a model to identify if a term, given its features, should be added to the back-of-the-book index of a specific book. The model achieves an F1-score of 0.91 (precision = 0.97, recall = 0.85).
Arabic Citation Parsing using Part of Speech and Named Entity Recognition
Youssef Karout | Hadi Hamoud | Fadi A. Zaraket
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Youssef Karout | Hadi Hamoud | Fadi A. Zaraket
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
This paper introduces an industry level citation element extractor from Arabic text. Citation element extraction enables editorial task automation for publishing houses, creation of citation networks, and automatic citation analytics for impact analysis firms. Citation library tools help users manage their citations. However, for Arabic, these tools lack basic support to identify and extract citation elements. Consequently, researchers, editors and reviewers manually manage Arabic citations tasks. We present a novel Arabic citation element dataset, use it to train a citation element extraction model, and use named entity recognition, morphological analysis, and keyword detection to improve the results for practical use. The paper reports industry ready performance with F1 scores ranging between .80 and .95 for interesting citation elements.
Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
Abdellah EL Mekki | Samar M. Magdy | Houdaifa Atou | Ruwa AbuHweidi | Baraah Qawasmeh | Omer Nacar | Thikra Al-hibiri | Razan Saadie | Hamzah A. Alsayadi | Nadia Ghezaiel Hammouda | Alshima Mohammed Alkhazimi | Aya Hamod | Al-Yas Yaqoob Al-Ghafri | Wesam El-Sayed | Asila Ismail al Sharji | Mohamad Ballout | Anas Belfathi | Karim Ghaddar | Serry Sibaee | Alaa Aoun | Aeej Mohammed Aseri | Lina Abureesh | Ahlam Bashiti | Majdal Yousef | Abdulaziz Hafiz | Yehdih Mohamed | Emira Hamedtou | Brakehe Emehah | Rahaf Alhamouri | Youssef Nafea | Aya El Aatar | Walid Al-Dhabyani | Emhemed S. Hamed | Sara Shatnawi | Fakhraddin Alwajih | Khalid Elkhidir | Ashwag Alasmari | Abdurrahman Gerrio | Omar Said Alshahri | AbdelRahim A. Elmadany | Ismail Berrada | Amir Azad Adli Al-kathiri | Fadi Zaraket | Mustafa Jarrar | Yahya Mohamed EL Hadj | Hassan Alhuzali | Muhammad Abdul-Mageed
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Abdellah EL Mekki | Samar M. Magdy | Houdaifa Atou | Ruwa AbuHweidi | Baraah Qawasmeh | Omer Nacar | Thikra Al-hibiri | Razan Saadie | Hamzah A. Alsayadi | Nadia Ghezaiel Hammouda | Alshima Mohammed Alkhazimi | Aya Hamod | Al-Yas Yaqoob Al-Ghafri | Wesam El-Sayed | Asila Ismail al Sharji | Mohamad Ballout | Anas Belfathi | Karim Ghaddar | Serry Sibaee | Alaa Aoun | Aeej Mohammed Aseri | Lina Abureesh | Ahlam Bashiti | Majdal Yousef | Abdulaziz Hafiz | Yehdih Mohamed | Emira Hamedtou | Brakehe Emehah | Rahaf Alhamouri | Youssef Nafea | Aya El Aatar | Walid Al-Dhabyani | Emhemed S. Hamed | Sara Shatnawi | Fakhraddin Alwajih | Khalid Elkhidir | Ashwag Alasmari | Abdurrahman Gerrio | Omar Said Alshahri | AbdelRahim A. Elmadany | Ismail Berrada | Amir Azad Adli Al-kathiri | Fadi Zaraket | Mustafa Jarrar | Yahya Mohamed EL Hadj | Hassan Alhuzali | Muhammad Abdul-Mageed
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges.The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria
From RAG to Agentic RAG for Faithful Islamic Question Answering
Gagan Bhatia | Hamdy Mubarak | Mustafa Jarrar | George Mikros | Fadi Zaraket | Mahmoud Alhirthani | Mutaz al-Khatib | Logan Cochrane | Kareem Mohamed Darwish | Rashid Yahiaoui | Firoj Alam
Findings of the Association for Computational Linguistics: ACL 2026
Gagan Bhatia | Hamdy Mubarak | Mustafa Jarrar | George Mikros | Fadi Zaraket | Mahmoud Alhirthani | Mutaz al-Khatib | Logan Cochrane | Kareem Mohamed Darwish | Rashid Yahiaoui | Firoj Alam
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and the ability to abstain when evidence is insufficient. To address this gap, we introduce IslamicFaithQA, a 3,810-item bilingual (Arabic/English) **generative** benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modeling suite consisting of *(i)* 25K Arabic text-grounded SFT reasoning pairs, *(ii)* 5K bilingual preference samples for reward-guided alignment, and *(iii)* a verse-level Qur’an retrieval corpus of ∼6k atomic *verses* (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic–English robustness even with a small model (i.e., Qwen3 4B). We made the datasets are publicly available (https://huggingface.co/datasets/QCRI/IslamicFaithQA).
2025
ImageEval 2025: The First Arabic Image Captioning Shared Task
Ahlam Bashiti | Alaa Aljabari | Hadi Hamoud | Md. Rafiul Biswas | Bilal Shalash | Mustafa Jarrar | Fadi Zaraket | George Mikros | Ehsaneddin Asgari | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Ahlam Bashiti | Alaa Aljabari | Hadi Hamoud | Md. Rafiul Biswas | Bilal Shalash | Mustafa Jarrar | Fadi Zaraket | George Mikros | Ehsaneddin Asgari | Wajdi Zaghouani
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
R-BPE: Improving BPE-Tokenizers with Token Reuse
Nancy Hamdan | Osama Rakan Al Mraikhat | Fadi A. Zaraket
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Nancy Hamdan | Osama Rakan Al Mraikhat | Fadi A. Zaraket
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper presents R-BPE, a lightweight framework for adapting existing Byte-Pair Encoding (BPE) tokenizers to better support a specified target language. It reuses tokens from user-excluded languages and creates ID-based maps to resolve the new tokens of the chosen language. We evaluate R-BPE on Arabic as a target language. R-BPE reduced subword fertility by an average of 24.4% across the LLaMA 3.1 8B, Command R 35B, and Qwen 3 8B models. Applied to LLaMA 3.1 8B in continued pretraining mode, R-BPE yields a 7.33% reduction in training time. On the ArabicMMLU benchmark, the resulting model improved by 5.09 points on five in-domain topics and matched the original model’s overall performance. It also preserved performance on EnglishMMLU. R-BPE effectively leverages existing models’ tokenizers, embedding layers, and performance to better support target languages without incurring model size changes. We release an R-BPE implementation that is compatible with HuggingFace interfaces and thereby readily applicable to a wide range of existing models at https://acr.ps/1L9GPmL.
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages
Artur Kiulian | Anton Polishko | Mykola Khandoga | Yevhen Kostiuk | Guillermo Gabrielli | Łukasz Gagała | Fadi Zaraket | Qusai Abu Obaida | Hrishikesh Garud | Wendy Wing Yee Mak | Dmytro Chaplynskyi | Selma Amor | Grigol Peradze
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)
Artur Kiulian | Anton Polishko | Mykola Khandoga | Yevhen Kostiuk | Guillermo Gabrielli | Łukasz Gagała | Fadi Zaraket | Qusai Abu Obaida | Hrishikesh Garud | Wendy Wing Yee Mak | Dmytro Chaplynskyi | Selma Amor | Grigol Peradze
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)
In this paper, we propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language. The method includes vocabulary expansion, initialization of new embeddings, model training and evaluation. We performed our experiments with three languages, each using a non-Latin script—Ukrainian, Arabic, and Georgian.Our approach demonstrates improved language performance while reducing computational costs. It mitigates the disproportionate penalization of underrepresented languages, promoting fairness and minimizing adverse phenomena such as code-switching and broken grammar. Additionally, we introduce new metrics to evaluate language quality, revealing that vocabulary size significantly impacts the quality of generated text.
2024
AREEj: Arabic Relation Extraction with Evidence
Osama Rakan Al Mraikhat | Hadi Hamoud | Fadi A. Zaraket
Proceedings of the Second Arabic Natural Language Processing Conference
Osama Rakan Al Mraikhat | Hadi Hamoud | Fadi A. Zaraket
Proceedings of the Second Arabic Natural Language Processing Conference
Relational entity extraction is key in building knowledge graphs. A relational entity has a source, a tail and a type. In this paper, we consider Arabic text and introduce evidence enrichment which intuitively informs models for better predictions. Relational evidence is an expression in the text that explains how sources and targets relate. This paper augments the existing SREDFM relational extraction dataset with evidence annotation to its 2.9-million Arabic relations. We leverage the augmented dataset to build AREEj, a relation extraction with evidence model from Arabic documents. The evidence augmentation model we constructed to complete the dataset achieved .82 F1-score (.93 precision, .73 recall). The target AREEj outperformed SOTA mREBEL with .72 F1-score (.78 precision, .66 recall).
DRU at WojoodNER 2024: A Multi-level Method Approach
Hadi Hamoud | Chadi Abou Chakra | Nancy Hamdan | Osama Rakan Al Mraikhat | Doha Albared | Fadi A. Zaraket
Proceedings of the Second Arabic Natural Language Processing Conference
Hadi Hamoud | Chadi Abou Chakra | Nancy Hamdan | Osama Rakan Al Mraikhat | Doha Albared | Fadi A. Zaraket
Proceedings of the Second Arabic Natural Language Processing Conference
In this paper, we present our submission for the WojoodNER 2024 Shared Tasks addressing flat and nested sub-tasks (1, 2). We experiment with three different approaches. We train (i) an Arabic fine-tuned version of BLOOMZ-7b-mt, GEMMA-7b, and AraBERTv2 on multi-label token classifications task; (ii) two AraBERTv2 models, on main types and sub-types respectively; and (iii) one model for main types and four for the four sub-types. Based on the Wojood NER 2024 test set results, the three fine-tuned models performed similarly with AraBERTv2 favored (F1: Flat=.8780 Nested=.9040). The five model approach performed slightly better (F1: Flat=.8782 Nested=.9043).
DRU at WojoodNER 2024: ICL LLM for Arabic NER
Nancy Hamdan | Hadi Hamoud | Chadi Abou Chakra | Osama Rakan Al Mraikhat | Doha Albared | Fadi A. Zaraket
Proceedings of the Second Arabic Natural Language Processing Conference
Nancy Hamdan | Hadi Hamoud | Chadi Abou Chakra | Osama Rakan Al Mraikhat | Doha Albared | Fadi A. Zaraket
Proceedings of the Second Arabic Natural Language Processing Conference
This paper details our submission to the WojoodNER Shared Task 2024, leveraging in-context learning with large language models for Arabic Named Entity Recognition. We utilized the Command R model, to perform fine-grained NER on the Wojood-Fine corpus. Our primary approach achieved an F1 score of 0.737 and a recall of 0.756. Post-processing the generated predictions to correct format inconsistencies resulted in an increased recall of 0.759, and a similar F1 score of 0.735. A multi-level prompting method and aggregation of outputs resulted in a lower F1 score of 0.637. Our results demonstrate the potential of ICL for Arabic NER while highlighting challenges related to LLM output consistency.
2023
Arabic Topic Classification in the Generative and AutoML Era
Doha Albared | Hadi Hamoud | Fadi Zaraket
Proceedings of ArabicNLP 2023
Doha Albared | Hadi Hamoud | Fadi Zaraket
Proceedings of ArabicNLP 2023
Most recent models for Arabic topic classification leveraged fine-tuning existing pre-trained transformer models and targeted a limited number of categories. More recently, advances in automated ML and generative models introduced novel potentials for the task. While these approaches work for English, it is a question of whether they perform well for low-resourced languages; Arabic in particular. This paper presents (i) ArBoNeClass; a novel Arabic dataset with an extended 14-topic class set covering modern books from social sciences and humanities along with newspaper articles, and (ii) a set of topic classifiers built from it. We finetuned an open LLM model to build ArGTClass. We compared its performance against the best models built with Vertex AI (Google), AutoML(H2O), and AutoTrain(HuggingFace). ArGTClass outperformed the VertexAi and AutoML models and was reasonably similar to the AutoTrain model.
Nâbra: Syrian Arabic Dialects with Morphological Annotations
Amal Nayouf | Tymaa Hammouda | Mustafa Jarrar | Fadi Zaraket | Mohamad-Bassam Kurdy
Proceedings of ArabicNLP 2023
Amal Nayouf | Tymaa Hammouda | Mustafa Jarrar | Fadi Zaraket | Mohamad-Bassam Kurdy
Proceedings of ArabicNLP 2023
This paper presents Nâbra (نَبْرَة), a corpora of Syrian Arabic dialects with morphological annotations. A team of Syrian natives collected more than 6K sentences containing about 60K words from several sources including social media posts, scripts of movies and series, lyrics of songs and local proverbs to build Nâbra. Nâbra covers several local Syrian dialects including those of Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and Suwayda. A team of nine annotators annotated the 60K tokens with full morphological annotations across sentence contexts. We trained the annotators to follow methodological annotation guidelines to ensure unique morpheme annotations, and normalized the annotations. F1 and 𝜅 agreement scores ranged between 74% and 98% across features, showing the excellent quality of Nâbra annotations. Our corpora are open-source and publicly available as part of the Currasat portal https://sina.birzeit.edu/currasat.
DAVE: Differential Diagnostic Analysis Automation and Visualization from Clinical Notes
Hadi Hamoud | Fadi Zaraket | Chadi Abou Chakra | Mira Dankar
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Hadi Hamoud | Fadi Zaraket | Chadi Abou Chakra | Mira Dankar
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
The Differential Analysis Visualizer for Electronic Medical Records (DAVE) is a tool that utilizes natural language processing and machine learning to help visualize diagnostic algorithms in real-time to help support medical professionals in their clinical decision-making process
2022
Curras + Baladi: Towards a Levantine Corpus
Karim Al-Haff | Mustafa Jarrar | Tymaa Hammouda | Fadi Zaraket
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Karim Al-Haff | Mustafa Jarrar | Tymaa Hammouda | Fadi Zaraket
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents two-fold contributions: a full revision of the Palestinian morphologically annotated corpus (Curras), and a newly annotated Lebanese corpus (Baladi). Both corpora can be used as a more general Levantine corpus. Baladi consists of around 9.6K morphologically annotated tokens. Each token was manually annotated with several morphological features and using LDC’s SAMA lemmas and tags. The inter-annotator evaluation on most features illustrates 78.5% Kappa and 90.1% F1-Score. Curras was revised by refining all annotations for accuracy, normalization and unification of POS tags, and linking with SAMA lemmas. This revision was also important to ensure that both corpora are compatible and can help to bridge the nuanced linguistic gaps that exist between the two highly mutually intelligible dialects. Both corpora are publicly available through a web portal.
2017
Morphology-based Entity and Relational Entity Extraction Framework for Arabic
Amin Jaber | Fadi A. Zaraket
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]
Amin Jaber | Fadi A. Zaraket
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]
2012
Search
Fix author
Co-authors
- Hadi Hamoud 8
- Mustafa Jarrar 5
- Osama Rakan Al Mraikhat 5
- Chadi Abou Chakra 4
- Doha Albared 3
- Nancy Hamdan 3
- Mohamad Ballout 2
- Ahlam Bashiti 2
- Tymaa Hammouda 2
- George Mikros 2
- Qusai Abu Obaida 2
- Muhammad Abdul-Mageed 1
- Ruwa AbuHweidi 1
- Lina Abureesh 1
- Walid Al-Dhabyani 1
- Al-Yas Yaqoob Al-Ghafri 1
- Karim Al-Haff 1
- Thikra Al-hibiri 1
- Amir Azad Adli Al-kathiri 1
- Firoj Alam 1
- Ashwag Alasmari 1
- Rahaf Alhamouri 1
- Mahmoud Alhirthani 1
- Hassan Alhuzali 1
- Alaa Aljabari 1
- Alshima Mohammed Alkhazimi 1
- Hamzah A. Alsayadi 1
- Omar Said Alshahri 1
- Fakhraddin Alwajih 1
- Selma Amor 1
- Alaa Aoun 1
- Aeej Mohammed Aseri 1
- Ehsaneddin Asgari 1
- Houdaifa Atou 1
- Anas Belfathi 1
- Ismail Berrada 1
- Gagan Bhatia 1
- Md. Rafiul Biswas 1
- Dmytro Chaplynskyi 1
- Logan Cochrane 1
- Mira Dankar 1
- Kareem Mohamed Darwish 1
- Yahya Mohamed EL Hadj 1
- Abdellah El Mekki 1
- Aya El aatar 1
- Wesam El-Sayed 1
- Khalid Elkhidir 1
- AbdelRahim A. Elmadany 1
- Brakehe Emehah 1
- Guillermo Gabrielli 1
- Hrishikesh Garud 1
- Abdurrahman Gerrio 1
- Karim Ghaddar 1
- Łukasz Gągała 1
- Abdulaziz Hafiz 1
- Nawal Haidar 1
- Emhemed S. Hamed 1
- Emira Hamedtou 1
- Nadia Ghezaiel Hammouda 1
- Aya Hamod 1
- Amin Jaber 1
- Youssef Karout 1
- Ahmad Kashmar 1
- Mykola Khandoga 1
- Artur Kiulian 1
- Yevhen Kostiuk 1
- Mohamad-Bassam Kurdy 1
- Samar Mohamed Magdy 1
- Jad Makhlouta 1
- Yehdih Mohamed 1
- Hamdy Mubarak 1
- Omer Nacar 1
- Youssef Nafea 1
- Amal Nayouf 1
- Grigol Peradze 1
- Anton Polishko 1
- Baraah Qawasmeh 1
- Razan Saadie 1
- Bilal Shalash 1
- Sara Shatnawi 1
- Serry Sibaee 1
- Wendy Wing Yee Mak 1
- Rashid Yahiaoui 1
- Majdal Yousef 1
- Wajdi Zaghouani 1
- Asila Ismail al Sharji 1
- Mutaz al-Khatib 1