Leila Pishdad


2025

pdf bib
Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Ziling Cheng | Meng Cao | Leila Pishdad | Yanshuai Cao | Jackie CK Cheung
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

2020

pdf bib
How coherent are neural models of coherence?
Leila Pishdad | Federico Fancellu | Ran Zhang | Afsaneh Fazly
Proceedings of the 28th International Conference on Computational Linguistics

Despite the recent advances in coherence modelling, most such models including state-of-the-art neural ones, are evaluated on either contrived proxy tasks such as the standard order discrimination benchmark, or tasks that require special expert annotation. Moreover, most evaluations are conducted on small newswire corpora. To address these shortcomings, in this paper we propose four generic evaluation tasks that draw on different aspects of coherence at both the lexical and document levels, and can be applied to any corpora. In designing these tasks, we aim at capturing coherence-specific properties, such as the correct use of discourse connectives, lexical cohesion, as well as the overall temporal and causal consistency among events and participants in a story. Importantly, our proposed tasks either rely on automatically-generated data, or data annotated for other purposes, hence alleviating the need for annotation specifically targeted to the task of coherence modelling. We perform experiments with several existing state-of-the-art neural models of coherence on these tasks, across large corpora from different domains, including newswire, dialogue, as well as narrative and instructional text. Our findings point to a strong need for revisiting the common practices in the development and evaluation of coherence models.