Zena Al-Khalili


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Saarland-Groningen at NADI 2025 Shared Task: Effective Dialectal Arabic Speech Processing under Data Constraints
Badr M. Abdullah | Yusser Al Ghussin | Zena Al-Khalili | Ömer Tarik Özyilmaz | Matias Valdenegro-Toro | Simon Ostermann | Dietrich Klakow
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

pdf bib
PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
Yunuo Liu | Dawei Zhu | Zena Al-Khalili | Dai Cheng | Yanjun Chen | Dietrich Klakow | Wei Zhang | Xiaoyu Shen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We present PricingLogic, the first benchmarkthat probes whether Large Language Mod-els (LLMs) can reliably automate tourism-booking prices when multiple, overlapping farerules apply. Travel agencies are eager to of-fload this error-prone task to AI systems; how-ever, deploying LLMs without verified reliabil-ity could result in significant financial lossesand erode customer trust. PricingLogic com-prises 300 natural-language questions based onbooking requests derived from 42 real-worldpricing policies, spanning two levels of diffi-culty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interactingdiscounts. Evaluations of a line of LLMs re-veal a steep performance drop on the harder tier,exposing systematic failures in rule interpreta-tion and arithmetic reasoning. These resultshighlight that, despite their general capabilities,today’s LLMs remain unreliable for revenue-critical applications without further safeguardsor domain adaptation. Our code and dataset areavaliable in https://github.com/EIT-NLP/PricingLogic.

pdf bib
Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics
Zena Al-Khalili | Nick Howell | Dietrich Klakow
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Assisting LLMs with code generation improved their performanceon mathematical reasoning tasks.However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs.In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness.Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest.Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs’ limits in the math domain.