Mugariya Farooq
2025
3LM: Bridging Arabic, STEM, and Code through Benchmarking
Basma El Amel Boussaha
|
Leen Al Qadi
|
Mugariya Farooq
|
Shaikha Alsuwaidi
|
Giulia Campesan
|
Ahmed Alzubaidi
|
Mohammed Alyafeai
|
Hakim Hacid
Proceedings of The Third Arabic Natural Language Processing Conference
Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in areas like STEM and coding domains that are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.
2023
AlGhafa Evaluation Benchmark for Arabic Language Models
Ebtesam Almazrouei
|
Ruxandra Cojocaru
|
Michele Baldo
|
Quentin Malartic
|
Hamza Alobeidli
|
Daniele Mazzotta
|
Guilherme Penedo
|
Giulia Campesan
|
Mugariya Farooq
|
Maitha Alhammadi
|
Julien Launay
|
Badreddine Noune
Proceedings of ArabicNLP 2023
Recent advances in the space of Arabic large language models have opened up a wealth of potential practical applications. From optimal training strategies, large scale data acquisition and continuously increasing NLP resources, the Arabic LLM landscape has improved in a very short span of time, despite being plagued by training data scarcity and limited evaluation resources compared to English. In line with contributing towards this ever-growing field, we introduce AlGhafa, a new multiple-choice evaluation benchmark for Arabic LLMs. For showcasing purposes, we train a new suite of models, including a 14 billion parameter model, the largest monolingual Arabic decoder-only model to date. We use a collection of publicly available datasets, as well as a newly introduced HandMade dataset consisting of 8 billion tokens. Finally, we explore the quantitative and qualitative toxicity of several Arabic models, comparing our models to existing public Arabic LLMs.
Search
Fix author
Co-authors
- Giulia Campesan 2
- Leen Al Qadi 1
- Maitha Alhammadi 1
- Ebtesam Almazrouei 1
- Hamza Alobeidli 1
- show all...