AraVQA: Building a New Arabic Factoid Visual Question Answering Dataset from Wikipedia

Sultan Alrowili, Younes Samih, Abed Alhakim Freihat, Mathan Kumar Eswaran


Abstract
The development of large-scale Visual Question Answering (VQA) datasets has traditionally relied on resource-intensive manual annotation. In addition, most of the existing Arabic VQA datasets focus on culturally-specific and dialect-aware domains. To address these limitations, we propose a new pipeline that leverages Wikipedia template tags to extract the relevant information for each image, which is subsequently utilized by the Large Language Model (LLM) to synthetically generate a new visual question answering dataset. Using this pipeline, we have constructed AraVQA, the most comprehensive Arabic Factoid Visual Question Answering dataset, containing more than 50,000 questions and covering over 20 varied primary subjects within Arabic general knowledge. Our detailed analysis shows that our dataset can serve as a post-training dataset to enhance the performance of existing Visual Language Models (VLMs) on Arabic VQA tasks. Furthermore, we present a novel benchmark, derived from our dataset and validated through manual annotation, that poses more challenges to Arabic VLMs than existing Arabic VQA datasets.
Anthology ID:
2026.acl-long.91
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2026–2042
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.91/
DOI:
Bibkey:
Cite (ACL):
Sultan Alrowili, Younes Samih, Abed Alhakim Freihat, and Mathan Kumar Eswaran. 2026. AraVQA: Building a New Arabic Factoid Visual Question Answering Dataset from Wikipedia. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2026–2042, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
AraVQA: Building a New Arabic Factoid Visual Question Answering Dataset from Wikipedia (Alrowili et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.91.pdf
Checklist:
 2026.acl-long.91.checklist.pdf