AMCrawl: An Arabic Web-Scale Dataset of Interleaved Image-Text Documents and Image-Text Pairs

Shahad Aboukozzana; Muhammad Kamran J Khan; Ahmed Ali

AMCrawl: An Arabic Web-Scale Dataset of Interleaved Image-Text Documents and Image-Text Pairs

Shahad Aboukozzana, Muhammad Kamran J Khan, Ahmed Ali

Abstract

In this paper, we present the Arabic Multimodal Crawl (AMCrawl), the first native-based Arabic multimodal dataset to our knowledge, derived from the Common Crawl corpus and rigorously filtered for quality and safety. Image-text pair datasets are the standard choice for pretraining multimodal large language models. However, they are often derived from image alt-text metadata, which is typically brief and context-poor, disconnecting images from their broader meaning. Although significant advances have been made in building interleaved image-text datasets for English, such as the OBELICS dataset, a substantial gap remains for native Arabic content. Our processing covered 8.6 million Arabic web pages, yielding 5.8 million associated images and 1.3 billion text tokens. The final dataset includes interleaved image-text documents and question-answer pairs, featuring 2.8 million high-quality interleaved documents and 5 million QA pairs. Alongside the dataset, we release the complete pipeline and code, ensuring reproducibility and encouraging further research and development. To demonstrate the effectiveness of AMCrawl, we introduce a publicly available native Arabic Vision Language model, trained with 13 billion parameters. These models achieve competitive results when benchmarked against publicly available datasets. AMCrawl bridges a critical gap in Arabic multimodal resources, providing a robust foundation for developing Arabic multimodal large language models and fostering advancements in this underrepresented area. Code: github.com/shahad-aboukozzana/AMCrawl

Anthology ID:: 2025.arabicnlp-main.37
Volume:: Proceedings of The Third Arabic Natural Language Processing Conference
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Kareem Darwish, Ahmed Ali, Ibrahim Abu Farha, Samia Touileb, Imed Zitouni, Ahmed Abdelali, Sharefah Al-Ghamdi, Sakhar Alkhereyf, Wajdi Zaghouani, Salam Khalifa, Badr AlKhamissi, Rawan Almatham, Injy Hamed, Zaid Alyafeai, Areeb Alowisheq, Go Inoue, Khalil Mrini, Waad Alshammari
Venue:: ArabicNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 448–465
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.37/
DOI:
Bibkey:
Cite (ACL):: Shahad Aboukozzana, Muhammad Kamran J Khan, and Ahmed Ali. 2025. AMCrawl: An Arabic Web-Scale Dataset of Interleaved Image-Text Documents and Image-Text Pairs. In Proceedings of The Third Arabic Natural Language Processing Conference, pages 448–465, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: AMCrawl: An Arabic Web-Scale Dataset of Interleaved Image-Text Documents and Image-Text Pairs (Aboukozzana et al., ArabicNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.arabicnlp-main.37.pdf

PDF Cite Search Fix data