Imaan Mohammed Alkhanen


2025

pdf bib
Saudi-Alignment Benchmark: Assessing LLMs Alignment with Cultural Norms and Domain Knowledge in the Saudi Context
Manal Alhassoun | Imaan Mohammed Alkhanen | Nouf Alshalawi | Ibtehal Baazeem | Waleed Alsanie
Proceedings of The Third Arabic Natural Language Processing Conference

For effective use in specific countries, Large Language Models (LLMs) need a strong grasp of local culture and core knowledge to ensure socially appropriate, context-aware, and factually correct responses. Existing Arabic and Saudi benchmarks are limited, focusing mainly on dialects or lifestyle, with little attention to deeper cultural or domain-specific alignment from authoritative sources. To address this gap and the challenge LLMs face with non-Western cultural nuance, this study introduces the Saudi-Alignment Benchmark. It consists of 874 manually curated questions across two core cultural dimensions: Saudi Cultural and Ethical Norms, and Saudi Domain Knowledge. These questions span multiple subcategories and use three formats to assess different goals with verified sources. Our evaluation reveals significant variance in LLM alignment. GPT-4 achieved the highest overall accuracy (83.3%), followed by ALLaM-7B (81.8%) and Llama-3.3-70B (81.6%), whereas Jais-30B exhibited a pronounced shortfall at 21.9%. Furthermore, multilingual LLMs excelled in norms; ALLaM-7B in domain knowledge. Considering the effect of question format, LLMs generally excelled in selected-response formats but showed weaker results on generative tasks, indicating that recognition-based benchmarks alone may overestimate cultural and contextual alignment. These findings highlight the need for tailored benchmarks and reveal LLMs’ limitations in achieving cultural grounding, particularly in underrepresented contexts like Saudi Arabia.