Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

Blanca Calvo Figueras; Rodrigo Agerri

doi:10.18653/v1/2025.findings-emnlp.302

Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models

Abstract

The task of Critical Questions Generation (CQs-Gen) aims to foster critical thinking by enabling systems to generate questions that expose underlying assumptions and challenge the validity of argumentative reasoning structures. Despite growing interest in this area, progress has been hindered by the lack of suitable datasets and automatic evaluation standards. This paper presents a comprehensive approach to support the development and benchmarking of systems for this task. We construct the first large-scale dataset including ~5K manually annotated questions. We also investigate automatic evaluation methods and propose reference-based techniques as the strategy that best correlates with human judgments. Our zero-shot evaluation of 11 LLMs establishes a strong baseline while showcasing the difficulty of the task. Data and code plus a public leaderboard are provided to encourage further research, not only in terms of model performance, but also to explore the practical benefits of CQs-Gen for both automated reasoning and human critical thinking.

Anthology ID:: 2025.findings-emnlp.302
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5635–5652
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.302/
DOI:: 10.18653/v1/2025.findings-emnlp.302
Bibkey:
Cite (ACL):: Blanca Calvo Figueras and Rodrigo Agerri. 2025. Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5635–5652, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models (Calvo Figueras & Agerri, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.302.pdf
Checklist:: 2025.findings-emnlp.302.checklist.pdf

PDF Cite Search Checklist Fix data