TounsiBench: Benchmarking Large Language Models for Tunisian Arabic

Souha Ben Hassine; Asma Arrak; Marouene Addhoum; Steven R. Wilson

TounsiBench: Benchmarking Large Language Models for Tunisian Arabic

Souha Ben Hassine, Asma Arrak, Marouene Addhoum, Steven R Wilson

Abstract

In this work, we introduce the first benchmark for evaluating the capabilities of large language models (LLMs) in understanding and generating responses in Tunisian Arabic. To achieve this, we construct a dataset of Tunisian Arabic instructions and prompt ten widely-used LLMs that claim to support Arabic. We then assess the LLM responses through both human and LLM-based evaluations across four criteria: quality, correctness, relevance, and dialectal adherence. We analyze the agreement and correlation between these judgments and identify GPT-4o as our automated judge model based on its high correlation with human ratings, and generate a final leaderboard using this model. Our error analysis reveals that most LLMs struggle with recognizing and properly responding in Tunisian Arabic. To facilitate further research, we release our dataset, along with gold-standard human-written responses for all 744 instructions, and our evaluation framework, allowing others to benchmark their own models.

Anthology ID:: 2025.emnlp-main.1756
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34615–34630
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1756/
DOI:
Bibkey:
Cite (ACL):: Souha Ben Hassine, Asma Arrak, Marouene Addhoum, and Steven R Wilson. 2025. TounsiBench: Benchmarking Large Language Models for Tunisian Arabic. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34615–34630, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: TounsiBench: Benchmarking Large Language Models for Tunisian Arabic (Hassine et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1756.pdf
Checklist:: 2025.emnlp-main.1756.checklist.pdf

PDF Cite Search Checklist Fix data