Curriculum Learning based Hierarchical Scoring and Analysis Framework for Question Answering Task Evaluation

Qiong Wu, Tan Yue, Jianxin Liang, Zhen Li, Kai He, Shuai Zhao, Dongyan Zhao


Abstract
The rapid progress of large language models (LLMs) has increased the demand for efficient and reliable evaluation of question answering (QA) systems. Existing evaluation methods either rely on rule-based matching with shallow semantic understanding or adopt LLM-as-a-Judge approaches that incur high cost and latency while offering limited error interpretability. Accordingly, we propose HiEval, a curriculum learning based hierarchical framework for QA task evaluation that supports both quick scoring and fine-grained error analysis. HiEval contains a quick scoring model (HiEval-QS) that predicts three-level correctness labels, and an error analysis model (HiEval-EA) that identifies incorrect responses into five error types. HiEval incorporates a class-balanced focal loss to handle label imbalance, experience replay to prevent forgetting, and contrastive unlikelihood optimization to improve error discrimination. We also construct two large-scale human-annotated evaluation datasets collected from 50 QA-related datasets, covering 8 task types and release two challenging benchmarks. Extensive experiments show that HiEval achieves state-of-the-art performance on both quick scoring and error analysis tasks, outperforming all baseline methods, including GPT-5, while being approximately 25× faster.
Anthology ID:
2026.findings-acl.332
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6672–6699
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.332/
DOI:
Bibkey:
Cite (ACL):
Qiong Wu, Tan Yue, Jianxin Liang, Zhen Li, Kai He, Shuai Zhao, and Dongyan Zhao. 2026. Curriculum Learning based Hierarchical Scoring and Analysis Framework for Question Answering Task Evaluation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6672–6699, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Curriculum Learning based Hierarchical Scoring and Analysis Framework for Question Answering Task Evaluation (Wu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.332.pdf
Checklist:
 2026.findings-acl.332.checklist.pdf