Faisal Abdulrahman Mirza
2025
AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models
Alhanoof Althnian
|
Norah A. Alzahrani
|
Shaykhah Z. Alsubaie
|
Eman Albilali
|
Ahmed Abdelali
|
Nouf M. Alotaibi
|
M Saiful Bari
|
Yazeed Alnumay
|
Abdulhamed Alothaimen
|
Maryam Saif
|
Shahad D. Alzaidi
|
Faisal Abdulrahman Mirza
|
Yousef Almushayqih
|
Mohammed Al Saleem
|
Ghadah Alabduljabbar
|
Abdulmohsen Al-Thubaity
|
Areeb Alowisheq
|
Nora Al-Twairesh
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The rapid advancements of Large Language models (LLMs) necessitate robust benchmarks. In this paper, we present AraEval, a pioneering and comprehensive evaluation suite specifically developed to assess the advanced knowledge, reasoning, truthfulness, and instruction- following capabilities of foundation models in the Arabic context. AraEval includes a diverse set of evaluation tasks that test various dimensions of knowledge and reasoning, with a total of 24,378 samples. These tasks cover areas such as linguistic understanding, factual recall, logical inference, commonsense reasoning, mathematical problem-solving, and domain-specific expertise, ensuring that the evaluation goes beyond basic language comprehension. It covers multiple domains of knowledge, such as science, history, religion, and literature, ensuring that the LLMs are tested on a broad spectrum of topics relevant to Arabic-speaking contexts. AraEval is designed to facilitate comparisons across different foundation models, enabling LLM developers and users to benchmark perfor- mance effectively. In addition, it provides diagnostic insights to identify specific areas where models excel or struggle, guiding further development. AraEval datasets can be found at https://huggingface.co/collections/humain-ai/araeval-datasets-687760e04b12a7afb429a4a0.