ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training
Majd Hawasly, Tasnim Mohiuddin, Hamdy Mubarak, Sabri Boughorbel
Abstract
The quality of training data plays a critical role in the performance of large language models (LLMs). This is especially true for low-resource languages where high-quality content is relatively scarce. Inspired by the success of FineWeb-Edu for English, we construct a native Arabic educational-quality dataset using similar methodological principles. We begin by sampling 1 million Arabic web documents from Common Crawl and labeling them into six quality classes (0–5) with Qwen-2.5-72B-Instruct model using a classification prompt adapted from FineWeb-Edu. These labeled examples are used to train a robust classifier capable of distinguishing educational content from general web text. We train a classification head on top of a multilingual 300M encoder model, then use this classifier to filter a large Arabic web corpus, discarding documents with low educational value. To evaluate the impact of this curation, we pretrain from scratch two bilingual English-Arabic 7B LLMs on 800 billion tokens using the filtered and unfiltered data and compare their performance across a suite of benchmarks. Our results show a significant improvement when using the filtered educational dataset, validating the effectiveness of quality filtering as a component in a balanced data mixture for Arabic LLM development. This work addresses the scarcity of high-quality Arabic training data and offers a scalable methodology for curating educational quality content in low-resource languages.- Anthology ID:
- 2025.arabicnlp-main.36
- Volume:
- Proceedings of The Third Arabic Natural Language Processing Conference
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Kareem Darwish, Ahmed Ali, Ibrahim Abu Farha, Samia Touileb, Imed Zitouni, Ahmed Abdelali, Sharefah Al-Ghamdi, Sakhar Alkhereyf, Wajdi Zaghouani, Salam Khalifa, Badr AlKhamissi, Rawan Almatham, Injy Hamed, Zaid Alyafeai, Areeb Alowisheq, Go Inoue, Khalil Mrini, Waad Alshammari
- Venue:
- ArabicNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 436–447
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.36/
- DOI:
- Cite (ACL):
- Majd Hawasly, Tasnim Mohiuddin, Hamdy Mubarak, and Sabri Boughorbel. 2025. ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training. In Proceedings of The Third Arabic Natural Language Processing Conference, pages 436–447, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training (Hawasly et al., ArabicNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.36.pdf