Saeed Almheiri
2026
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues
Muhammad Dehan Al Kautsar | Saeed Almheiri | Momina Ahsan | Bilal Elbouardi | Younes Samih | Sarfraz Ahmad | Amr Keleg | Omar El Herraoui | Kareem Elzeky | Abed Alhakim Freihat | Mohamed Anwar | Zhuohan Xie | Junhong Liang | Mohammad Rustom Al Nasar | Preslav Nakov | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Muhammad Dehan Al Kautsar | Saeed Almheiri | Momina Ahsan | Bilal Elbouardi | Younes Samih | Sarfraz Ahmad | Amr Keleg | Omar El Herraoui | Kareem Elzeky | Abed Alhakim Freihat | Mohamed Anwar | Zhuohan Xie | Junhong Liang | Mohammad Rustom Al Nasar | Preslav Nakov | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country’s respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.
Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages
Saeed Almheiri | Bilal Elbouardi | Salsabila Zahirah Pranida | Irina Nikishina | Ashwath Rao B | Parameswari Krishnamurthy | Muhammad Cendekia Airlangga | Rifo Ahmad Genadi | Nguyen Phan Gia Bao | Amir Hossein Yari | Hawau Olamide Toyin | Nurdaulet Mukhituly | Mena Attia | Besher Hassan | Ahmad Fathan Hidayatullah | Tatsuki Kuribayashi | Haonan Li | Suma Bhat | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Saeed Almheiri | Bilal Elbouardi | Salsabila Zahirah Pranida | Irina Nikishina | Ashwath Rao B | Parameswari Krishnamurthy | Muhammad Cendekia Airlangga | Rifo Ahmad Genadi | Nguyen Phan Gia Bao | Amir Hossein Yari | Hawau Olamide Toyin | Nurdaulet Mukhituly | Mena Attia | Besher Hassan | Ahmad Fathan Hidayatullah | Tatsuki Kuribayashi | Haonan Li | Suma Bhat | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.
Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (MFMD). In this work, we propose MFMDScen, a comprehensive benchmark for evaluating behavioral biases of LLMs in MFMD across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, MFMDScen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project is available at https://github.com/lzw108/FMD.
2025
Commonsense Reasoning in Arab Culture
Abdelrahman Sadallah | Junior Cedric Tonga | Khalid Almubarak | Saeed Almheiri | Farah Atif | Chatrine Qwaider | Karima Kadaoui | Sara Shatnawi | Yaser Alesh | Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Abdelrahman Sadallah | Junior Cedric Tonga | Khalid Almubarak | Saeed Almheiri | Farah Atif | Chatrine Qwaider | Karima Kadaoui | Sara Shatnawi | Yaser Alesh | Fajri Koto
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite progress in Arabic large language models, such as Jais and AceGPT, their evaluation on commonsense reasoning has largely relied on machine-translated datasets, which lack cultural depth and may introduce Anglocentric biases. Commonsense reasoning is shaped by geographical and cultural contexts, and existing English datasets fail to capture the diversity of the Arab world. To address this, we introduce , a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley. The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries. spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences. Zero-shot evaluations show that open-weight language models with up to 32B parameters struggle to comprehend diverse Arab cultures, with performance varying across regions. These findings highlight the need for more culturally aware models and datasets tailored to the Arabic-speaking world.
Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World
Saeed Almheiri | Rania Elbadry | Mena Attia | Chenxi Wang | Preslav Nakov | Timothy Baldwin | Fajri Koto
Findings of the Association for Computational Linguistics: EMNLP 2025
Saeed Almheiri | Rania Elbadry | Mena Attia | Chenxi Wang | Preslav Nakov | Timothy Baldwin | Fajri Koto
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) often reflect Western-centric biases, limiting their effectiveness in diverse cultural contexts. Although some work has explored cultural alignment, the potential for cross-cultural transfer, using alignment in one culture to improve performance in others, remains underexplored. This paper investigates cross-cultural transfer of commonsense reasoning within the Arab world, where linguistic and historical similarities coexist with local cultural differences. Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods such as in-context learning (ICL) and demonstration-based reinforcement (DITTO), alongside baselines like supervised fine-tuning (SFT) and direct preference Optimization (DPO). Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average, within multilingual models. In addition, we demonstrate that out-of-culture demonstrations from Indonesia and US contexts can match or surpass in-culture alignment for MCQ reasoning, highlighting cultural commonsense transferability beyond Arab world. These findings demonstrate that efficient cross-cultural alignment is possible and offer a promising approach to adapt LLMs to low-resource cultural settings.
Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
Saeed Almheiri | Yerulan Kongrat | Adrian Santosh | Ruslan Tasmukhanov | Josemaria Loza Vera | Muhammad Dehan Al Kautsar | Fajri Koto
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Saeed Almheiri | Yerulan Kongrat | Adrian Santosh | Ruslan Tasmukhanov | Josemaria Loza Vera | Muhammad Dehan Al Kautsar | Fajri Koto
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.
Search
Fix author
Co-authors
- Fajri Koto 5
- Muhammad Dehan Al Kautsar 2
- Mena Attia 2
- Bilal Elbouardi 2
- Preslav Nakov 2
- Zhuohan Xie 2
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Muhammad Cendekia Airlangga 1
- Mohammad Rustom Al Nasar 1
- Yaser Alesh 1
- Khalid Almubarak 1
- Abdulrazzaq Alnajjar 1
- Sophia Ananiadou 1
- Mohamed Anwar 1
- Farah Atif 1
- Ashwath Rao B 1
- Timothy Baldwin 1
- Nguyen Phan Gia Bao 1
- Suma Bhat 1
- Yupeng Cao 1
- Ming-Bin Chen 1
- Omar El Herraoui 1
- Rania Elbadry 1
- Kareem Elzeky 1
- Abed Alhakim Freihat 1
- Rifo Ahmad Genadi 1
- Polydoros Giannouris 1
- Besher Hassan 1
- Ahmad Fathan Hidayatullah 1
- Jimin Huang 1
- Yuechen Jiang 1
- Mohsinul Kabir 1
- Karima Kadaoui 1
- Amr Keleg 1
- Yerulan Kongrat 1
- Parameswari Krishnamurthy 1
- Tatsuki Kuribayashi 1
- Haonan Li 1
- Junhong Liang 1
- Zhiwei Liu 1
- Xue Liu 1
- Alejandro Lopez-Lira 1
- Nurdaulet Mukhituly 1
- Irina Nikishina 1
- Triantafillos Papadopoulos 1
- Xueqing Peng 1
- Salsabila Zahirah Pranida 1
- Lingfei Qian 1
- Chatrine Qwaider 1
- Abdelrahman Sadallah 1
- Younes Samih 1
- Adrian Santosh 1
- Sara Shatnawi 1
- Harry Stuart 1
- Md. Tariquzzaman 1
- Ruslan Tasmukhanov 1
- Paul Thompson 1
- Prayag Tiwari 1
- Junior Cedric Tonga 1
- Hawau Olamide Toyin 1
- Josemaria Loza Vera 1
- Chenxi Wang 1
- Yan Wang 1
- Chen Xu 1
- Ziyang Xu 1
- Amir Hossein Yari 1
- Ye Yuan 1
- Tianlei Zhu 1