Mounir Ghogho

2026

Candidate-Aware Retrieval and Reranking for Multiple-Choice Question Answering: Arabic as a Case Study
Yassine Bouziane | Youness Moukafih | Mounir Ghogho
Findings of the Association for Computational Linguistics: ACL 2026

Large language models (LLMs) have recently achieved impressive results on multiple-choice question answering (MCQA), with retrieval-augmented generation (RAG) emerging as an effective strategy for improving the performance of smaller models. However, existing RAG formulations face persistent challenges: retrieving too many passages often introduces noise, and even when relevant content is retrieved, models may still struggle with partially relevant or conflicting information. Moreover, while LLMs perform strongly on English benchmarks, their accuracy declines substantially on Arabic multi-task evaluations, revealing ongoing issues in cross-lingual transfer and domain adaptation. In this paper, we propose a novel approach, using Arabic as a representative case study, that jointly models the relevance of both the question and its candidate answers when selecting contextual passages. The method employs a lightweight reranker trained with a hybrid regression–triplet loss objective to identify passages that provide discriminative and reliable evidence. Extensive experiments across multiple model sizes and humanities domains show that our approach consistently outperforms both standard RAG baselines and reranker baselines, delivering two- to threefold improvements while remaining competitive with considerably larger models.

pdf bib abs

MizanQA: A Benchmark for Multi-Answer Moroccan Legal QA
Adil Bahaj | Mounir Ghogho
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

We present MizanQA, a benchmark for assessing LLMs on Moroccan legal MCQs, many with multiple correct answers. Covering 1,776 expert-verified questions in Modern Standard Arabic enriched with Moroccan idioms, the dataset reflects influences from Maliki jurisprudence, customary law, and French legal traditions. Unlike single-answer settings, MizanQA features variable option counts, creating added difficulty. We evaluate multilingual and Arabic-centric models in zero-shot, native-Arabic prompts, measuring accuracy, a precision-penalized F1-like score, and calibration errors. Results show large performance gaps and miscalibration, particularly under stricter penalties. By scoping this benchmark to parametric knowledge only, we provide a baseline for future retrieval-augmented and rationale-focused setups.

2025

pdf bib abs

Evaluating LLMs Efficiency Using Successive Attempts on Binary-Outcome Tasks
Mohamed Amine El Yagouby | Mehdi Zekroum | Abdelkader Lahmadi | Mounir Ghogho | Olivier Festor
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

Evaluating Large Language Models (LLMs) using single-attempt metrics like Success Rate (SR) overlooks their capacity for iterative problem solving. In tasks with binary outcomes (success or failure), such as coding or planning, LLMs often benefit from multiple attempts. Existing multiattempt metrics like pass@k and success@k account for eventual success but ignore how efficiently it is achieved, making them more costly. We propose a new evaluation method with Successive Multiple Attempts, where a maximum number of retries is fixed, and introduce our Success Efficiency (SE) metric, which captures both success and efficiency in a single value by rewarding earlier successes and penalizing delays. Tested using the HumanEval dataset across six LLMs, SE captures how quickly an LLM solves tasks, which existing metrics do not offer. This work complements existing evaluation methods by measuring not only whether LLMs succeed but also how efficiently they do so.

Co-authors

Youness Moukafih 1

Mehdi Zekroum 1

Venues

Fix author