Marzieh Abdolmaleki

2026

Counter-Hypothesis Generation: Towards Evaluating How LLMs Reason about Alternatives
Marzieh Abdolmaleki | Aaron Maladry | Veronique Hoste | Els Lefever
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Reasoning about alternatives is a fundamental component of human cognition and argumentation, yet it remains unclear whether large language models (LLMs) can coherently generate and assess them. This paper introduces Counter-Hypothesis Generation (CHG), a novel task for evaluating how LLMs construct plausible hypotheses when contextual information changes. Inspired by open-domain commonsense reasoning, where models infer and compare multiple explanations, CHG bridges commonsense and counterfactual reasoning by requiring models to generate hypotheses that remain logically consistent with modified premises. We present a test set annotated by a human expert and complemented with counter-hypotheses generated by OpenAI-o3 and DeepSeek-r1. Experimental results reveal that even advanced reasoning models exhibit notable limitations in counter-hypothesis generation.

pdf bib abs

PMWP: A Benchmark for Math Word Problem Solving in Persian
Marzieh Abdolmaleki | Mehrnoush Shamsfard | Veronique Hoste | Els Lefever
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family

Mathematical reasoning captures fundamental aspects of human cognitive ability. Although recent advances in LLMs have led to substantial improvements in automated mathematical problem solving, most existing benchmarks remain focused on English. As a result, robust mathematical reasoning remains a challenging and insufficiently explored capability for underrepresented languages including Persian. To address this gap, we introduce PMWP, the first dataset of 15K elementary-level Persian math word problems that supports both supervised training and evaluation of reasoning models. By expanding mathematical reasoning resources beyond English, PMWP contributes to the development of multilingual AI systems with stronger reasoning capabilities. In this work, we conduct a systematic evaluation of the Persian math word problem solving capabilities of different state-of-the-art LLMs. Our results indicate that DeepSeek-V3 exhibits reduced language bias when problem texts are translated into English, while Gemini-2.5-Flash achieves the highest equation value accuracy (72.02%) in Persian. In addition, we investigate parameter-efficient adaptation for equation generation by applying LoRA-based fine-tuning to LLaMA-3-8B and Qwen-2.5-7B. Our results show that, following fine-tuning, these openweight models achieve 91.65% and 92.53% exact equation match accuracy, respectively. Overall, our findings provide insights into the comparative strengths and limitations of proprietary and open-weight models for mathematical reasoning in Persian.

Co-authors

Venues

Fix author