Salem Lahlou
2026
ArabicDialectHub: A Cross-Dialectal Arabic Learning Resource and Platform
Salem Lahlou
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Salem Lahlou
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
We present ArabicDialectHub, a cross-dialectal Arabic learning resource comprising 552 phrases across six varieties (Moroccan Darija, Lebanese, Syrian, Emirati, Saudi, and MSA) and an interactive web platform. Phrases were generated using LLMs and validated by five native speakers, stratified by difficulty, and organized thematically. The open-source platform provides translation exploration, adaptive quizzing with algorithmic distractor generation, cloud-synchronized progress tracking, and cultural context. Both the dataset and complete platform source code are released under MIT license. Platform: https://arabic-dialect-hub.netlify.app.
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.
Communication Enables Cooperation in LLM Agents: A Comparison with Curriculum-Based Approaches
Hachem Madmoun | Salem Lahlou
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Hachem Madmoun | Salem Lahlou
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Eliciting cooperation in multi-agent LLM systems is critical for AI alignment. We investigate two approaches: direct communication and curriculum learning. In a 4-player Stag Hunt, a one-word "cheap talk" channel increases cooperation from 0% to 48.3%, demonstrating communication as a robust coordination mechanism. In contrast, we find that curriculum learning is highly sensitive to design choices: our pedagogical curriculum through progressively complex games reduced agent payoffs by 27.4% in an Iterated Public Goods Game with Punishment. Qualitative analysis reveals that curricula emphasizing defection-equilibrium games can induce "learned pessimism" in agents. These findings suggest that for coordination problems, simple communication protocols may be more reliable than experience-based training, and that curriculum design for social dilemmas requires careful attention to the strategic lessons embedded in game sequences.
Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models
Yanda Li | Yuhan Liu | Zirui Song | Yunchao Wei | Martin Tak\'a\v{c} | Salem Lahlou
Findings of the Association for Computational Linguistics: ACL 2026
Yanda Li | Yuhan Liu | Zirui Song | Yunchao Wei | Martin Tak\'a\v{c} | Salem Lahlou
Findings of the Association for Computational Linguistics: ACL 2026
Large audio-language models (LALMs) generalize across speech, sound, and music, but unified decoders can exhibit a temporal smoothing bias: transient acoustic cues may be underutilized in favor of temporally smooth context that is better supported by language priors, leading to less specific audio-grounded outputs. We propose Temporal Contrastive Decoding (TCD), a training-free decoding method for unified LALMs that mitigates this effect at inference time. TCD constructs a temporally blurred slow-path view by smoothing the input waveform and re-encoding it, then contrasts next-token logits from the original and slow-path views. The contrastive signal is applied as a token-level logit update restricted to a small candidate set. A self-normalized stability score sets the blur window and update scale, and a step-wise gate based on uncertainty and audio reliance activates the update only when needed. Experiments on MMAU and AIR-Bench show consistent improvements on strong unified LALMs. We further conduct ablations and an architectural applicability study to analyze the contributions of key components and how TCD behaves across large audio-language model designs.
SD-E2: Semantic Exploration for Reasoning Under Token Budgets
Kshitij Mishra | Nils Lukas | Salem Lahlou
Findings of the Association for Computational Linguistics: EACL 2026
Kshitij Mishra | Nils Lukas | Salem Lahlou
Findings of the Association for Computational Linguistics: EACL 2026
SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event–cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event–cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.
2025
PORT: Preference Optimization on Reasoning Traces
Salem Lahlou | Abdalgader Abubaker | Hakim Hacid
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Salem Lahlou | Abdalgader Abubaker | Hakim Hacid
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Preference optimization methods have been successfully applied to improve not only the alignment of large language models (LLMs) with human values, but also specific natural language tasks such as summarization and stylistic continuations. This paper proposes using preference optimization methods on Chain-of-Thought steps in order to improve the mathematical reasoning performances of language models. While the chosen answers are obtained from datasets that include reasoning traces, we propose two complementary schemes for generating rejected answers: weak LLM prompting, and digit corruption. Our approach leads to increased accuracy on the GSM8K and AQuA-RAT mathematical reasoning benchmarks for Falcon2-11B and Mistral-7B. Additionally, the improved abilities transfer to non-mathematical tasks, including the ARC benchmark and symbolic reasoning challenges. For example, our method can lead to up to relative 8.47 and 18.73 increases in accuracy on the GSM8K and AQuA benchmarks respectively, without any extra annotations. This work suggests that the path towards better language reasoning abilities goes through spending resources on creating high-quality datasets of reasoning traces.
Search
Fix author
Co-authors
- Sophia Ananiadou 2
- Rania Elbadry 2
- Jimin Huang 2
- Hachem Madmoun 2
- Preslav Nakov 2
- Xueqing Peng 2
- Veselin Stoyanov 2
- Yuxia Wang 2
- Zhuohan Xie 2
- Abdalgader Abubaker 1
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Muhra AlMahri 1
- Debopriyo Banerjee 1
- Dani Bouch 1
- Tanmoy Chakraborty 1
- Georgi Nenkov Georgiev 1
- Hakim Hacid 1
- Ahmed Heakl 1
- Marwa Elsaid Khalil 1
- Fajri Koto 1
- Ivan Koychev 1
- Haonan Li 1
- Yanda Li 1
- Yuhan Liu 1
- Nils Lukas 1
- Kshitij Mishra 1
- Daniil Orel 1
- Lingfei Qian 1
- Dhruv Sahnan 1
- Aaryamonvikram Singh 1
- Zirui Song 1
- Jinyan Su 1
- Martin Tak\'a\v{c} 1
- Rushil Thareja 1
- Yunchao Wei 1
- Rui Xing 1
- Chen Xu 1
- Fan Zhang 1