Bilal Elbouardi
2026
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues
Muhammad Dehan Al Kautsar | Saeed Almheiri | Momina Ahsan | Bilal Elbouardi | Younes Samih | Sarfraz Ahmad | Amr Keleg | Omar El Herraoui | Kareem Elzeky | Abed Alhakim Freihat | Mohamed Anwar | Zhuohan Xie | Junhong Liang | Mohammad Rustom Al Nasar | Preslav Nakov | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Muhammad Dehan Al Kautsar | Saeed Almheiri | Momina Ahsan | Bilal Elbouardi | Younes Samih | Sarfraz Ahmad | Amr Keleg | Omar El Herraoui | Kareem Elzeky | Abed Alhakim Freihat | Mohamed Anwar | Zhuohan Xie | Junhong Liang | Mohammad Rustom Al Nasar | Preslav Nakov | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country’s respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.
Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages
Saeed Almheiri | Bilal Elbouardi | Salsabila Zahirah Pranida | Irina Nikishina | Ashwath Rao B | Parameswari Krishnamurthy | Muhammad Cendekia Airlangga | Rifo Ahmad Genadi | Nguyen Phan Gia Bao | Amir Hossein Yari | Hawau Olamide Toyin | Nurdaulet Mukhituly | Mena Attia | Besher Hassan | Ahmad Fathan Hidayatullah | Tatsuki Kuribayashi | Haonan Li | Suma Bhat | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Saeed Almheiri | Bilal Elbouardi | Salsabila Zahirah Pranida | Irina Nikishina | Ashwath Rao B | Parameswari Krishnamurthy | Muhammad Cendekia Airlangga | Rifo Ahmad Genadi | Nguyen Phan Gia Bao | Amir Hossein Yari | Hawau Olamide Toyin | Nurdaulet Mukhituly | Mena Attia | Besher Hassan | Ahmad Fathan Hidayatullah | Tatsuki Kuribayashi | Haonan Li | Suma Bhat | Fajri Koto
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.
Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling
Ivan Rodkin | Daniil Orel | Konstantin Smirnov | Arman Bolatov | Bilal Elbouardi | Besher Hassan | Yuri Kuratov | Aydar Bulatov | Preslav Nakov | Timothy Baldwin | Artem Shelmanov | Mikhail Burtsev
Findings of the Association for Computational Linguistics: ACL 2026
Ivan Rodkin | Daniil Orel | Konstantin Smirnov | Arman Bolatov | Bilal Elbouardi | Besher Hassan | Yuri Kuratov | Aydar Bulatov | Preslav Nakov | Timothy Baldwin | Artem Shelmanov | Mikhail Burtsev
Findings of the Association for Computational Linguistics: ACL 2026
Reasoning is a core capability of large language models (LLMs), yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorization by using disjoint training and test rules. Given a short state sequence, the model is required to infer the hidden local rule and then chain it to predict multiple future steps. Our evaluation shows that LLMs largely fail to reliably solve a natural-language proxy of the proposed task. We find that most neural architectures trained from scratch can learn rule inference and achieve high next-step accuracy, but performance drops sharply as the required number of intermediate reasoning steps increases. Experiments show that increasing model depth is crucial, and extending effective depth via recurrence, memory, or test-time compute improves results but remains bounded. Code is available on github: https://github.com/RodkinIvan/associative-recurrent-memory-transformer/tree/ACT.
2025
ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition
Ali Abouzeid | Bilal Elbouardi | Mohamed Maged | Shady Shehata
Proceedings of The Third Arabic Natural Language Processing Conference
Ali Abouzeid | Bilal Elbouardi | Mohamed Maged | Shady Shehata
Proceedings of The Third Arabic Natural Language Processing Conference
Speech emotion recognition is vital for human-computer interaction, particularly for low-resource languages like Arabic, which face challenges due to limited data and research. We introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and deliver state-of-the-art performance. Unlike previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced spectro-temporal patterns, ArabEmoNet uses Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often lost in traditional methods. While recent models favor large-scale architectures with millions of parameters, ArabEmoNet achieves superior results with just 1 million parameters—90 times smaller than HuBERT base and 74 times smaller than Whisper. This efficiency makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech emotion recognition, offering exceptional performance and accessibility for real-world applications.
Search
Fix author
Co-authors
- Saeed Almheiri 2
- Besher Hassan 2
- Fajri Koto 2
- Preslav Nakov 2
- Ali Abouzeid 1
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Muhammad Cendekia Airlangga 1
- Muhammad Dehan Al Kautsar 1
- Mohammad Rustom Al Nasar 1
- Mohamed Anwar 1
- Mena Attia 1
- Ashwath Rao B 1
- Timothy Baldwin 1
- Nguyen Phan Gia Bao 1
- Suma Bhat 1
- Arman Bolatov 1
- Aydar Bulatov 1
- Mikhail Burtsev 1
- Omar El Herraoui 1
- Kareem Elzeky 1
- Abed Alhakim Freihat 1
- Rifo Ahmad Genadi 1
- Ahmad Fathan Hidayatullah 1
- Amr Keleg 1
- Parameswari Krishnamurthy 1
- Yurii Kuratov 1
- Tatsuki Kuribayashi 1
- Haonan Li 1
- Junhong Liang 1
- Mohamed Maged 1
- Nurdaulet Mukhituly 1
- Irina Nikishina 1
- Daniil Orel 1
- Salsabila Zahirah Pranida 1
- Ivan Rodkin 1
- Younes Samih 1
- Shady Shehata 1
- Artem Shelmanov 1
- Konstantin Smirnov 1
- Hawau Olamide Toyin 1
- Zhuohan Xie 1
- Amir Hossein Yari 1