ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training

Majd Hawasly, Tasnim Mohiuddin, Hamdy Mubarak, Sabri Boughorbel


Abstract
The quality of training data plays a critical role in the performance of large language models (LLMs). This is especially true for low-resource languages where high-quality content is relatively scarce. Inspired by the success of FineWeb-Edu for English, we construct a native Arabic educational-quality dataset using similar methodological principles. We begin by sampling 1 million Arabic web documents from Common Crawl and labeling them into six quality classes (0–5) with Qwen-2.5-72B-Instruct model using a classification prompt adapted from FineWeb-Edu. These labeled examples are used to train a robust classifier capable of distinguishing educational content from general web text. We train a classification head on top of a multilingual 300M encoder model, then use this classifier to filter a large Arabic web corpus, discarding documents with low educational value. To evaluate the impact of this curation, we pretrain from scratch two bilingual English-Arabic 7B LLMs on 800 billion tokens using the filtered and unfiltered data and compare their performance across a suite of benchmarks. Our results show a significant improvement when using the filtered educational dataset, validating the effectiveness of quality filtering as a component in a balanced data mixture for Arabic LLM development. This work addresses the scarcity of high-quality Arabic training data and offers a scalable methodology for curating educational quality content in low-resource languages.
Anthology ID:
2025.arabicnlp-main.36
Volume:
Proceedings of The Third Arabic Natural Language Processing Conference
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Kareem Darwish, Ahmed Ali, Ibrahim Abu Farha, Samia Touileb, Imed Zitouni, Ahmed Abdelali, Sharefah Al-Ghamdi, Sakhar Alkhereyf, Wajdi Zaghouani, Salam Khalifa, Badr AlKhamissi, Rawan Almatham, Injy Hamed, Zaid Alyafeai, Areeb Alowisheq, Go Inoue, Khalil Mrini, Waad Alshammari
Venue:
ArabicNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
436–447
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.36/
DOI:
Bibkey:
Cite (ACL):
Majd Hawasly, Tasnim Mohiuddin, Hamdy Mubarak, and Sabri Boughorbel. 2025. ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training. In Proceedings of The Third Arabic Natural Language Processing Conference, pages 436–447, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training (Hawasly et al., ArabicNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.36.pdf