Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt


Abstract
Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets—TriviaQA, NQ, MuSiQue, and QASC—demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS’s difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.
Anthology ID:
2026.acl-long.510
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11124–11151
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.510/
DOI:
Bibkey:
Cite (ACL):
Jamshid Mozafari, Bhawna Piryani, and Adam Jatowt. 2026. Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11124–11151, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring (Mozafari et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.510.pdf
Checklist:
 2026.acl-long.510.checklist.pdf