Curriculum Learning based Hierarchical Scoring and Analysis Framework for Question Answering Task Evaluation
Qiong Wu, Tan Yue, Jianxin Liang, Zhen Li, Kai He, Shuai Zhao, Dongyan Zhao
Abstract
The rapid progress of large language models (LLMs) has increased the demand for efficient and reliable evaluation of question answering (QA) systems. Existing evaluation methods either rely on rule-based matching with shallow semantic understanding or adopt LLM-as-a-Judge approaches that incur high cost and latency while offering limited error interpretability. Accordingly, we propose HiEval, a curriculum learning based hierarchical framework for QA task evaluation that supports both quick scoring and fine-grained error analysis. HiEval contains a quick scoring model (HiEval-QS) that predicts three-level correctness labels, and an error analysis model (HiEval-EA) that identifies incorrect responses into five error types. HiEval incorporates a class-balanced focal loss to handle label imbalance, experience replay to prevent forgetting, and contrastive unlikelihood optimization to improve error discrimination. We also construct two large-scale human-annotated evaluation datasets collected from 50 QA-related datasets, covering 8 task types and release two challenging benchmarks. Extensive experiments show that HiEval achieves state-of-the-art performance on both quick scoring and error analysis tasks, outperforming all baseline methods, including GPT-5, while being approximately 25× faster.- Anthology ID:
- 2026.findings-acl.332
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6672–6699
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.332/
- DOI:
- Cite (ACL):
- Qiong Wu, Tan Yue, Jianxin Liang, Zhen Li, Kai He, Shuai Zhao, and Dongyan Zhao. 2026. Curriculum Learning based Hierarchical Scoring and Analysis Framework for Question Answering Task Evaluation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6672–6699, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Curriculum Learning based Hierarchical Scoring and Analysis Framework for Question Answering Task Evaluation (Wu et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.332.pdf