Ali Khoramfar

2026

DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance
Ali Khoramfar | Ali Ramezani | Mohammad Mahdi Mohajeri | Mohammad Javad Dousti | Majid Nili Ahmadabadi | Heshaam Faili
Proceedings of the Fifteenth Language Resources and Evaluation Conference

While Large Language Models (LLMs) achieve near-human performance on standard benchmarks, their capabilities often fail to generalize to complex, real-world problems. To bridge this gap, we introduce DeepQuestion, a scalable, automated framework that systematically elevates the cognitive complexity of existing datasets through controlled task transformations grounded in explicit cognitive hierarchies. Based on Bloom’s taxonomy, DeepQuestion generates (1) scenario-based problems to test the application of knowledge in noisy, realistic contexts, and (2) instruction-based prompts that require models to create new questions from a given solution path, assessing synthesis and evaluation skills. Our extensive evaluation across ten leading open-source and proprietary models, covering both general-purpose and reasoning LLMs, reveals a stark performance decline—with accuracy dropping by up to 70%—as tasks ascend the cognitive hierarchy across evaluation settings. These findings underscore that current benchmarks overestimate true reasoning abilities and highlight the critical need for cognitively diverse evaluations to guide future LLM development.

2025

pdf bib abs

PerMed-MM: A Multimodal, Multi-Specialty Persian Medical Benchmark for Evaluating Vision Language Models
Ali Khoramfar | Mohammad Javad Dousti | Heshaam Faili
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

We present PerMed-MM, the first multimodal benchmark for Persian medical question answering. The dataset comprises 733 expert-authored multiple-choice questions from Iranian National Medical Board Exams, each paired with one to five clinically relevant images, spanning 46 medical specialties and diverse visual modalities. We evaluate five open-source and five proprietary vision language models, and find that reasoning supervision and domain-specific fine-tuning yield performance gains. Our cross-lingual analysis reveals significant unpredictability in translation-based pipelines, motivating the need for benchmarks that support direct, native-language evaluation. Additionally, domain- and modality-level analysis uncovers meaningful variation in model behavior often masked by aggregate metrics. PerMed-MM is publicly available on Hugging Face.

Co-authors

Venues

Fix author