AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs

Qingqing Lyu; Linjuan Wu; Yongliang Shen; Hengwei Liu; Hao Li; Shengpei Jiang; Yin Zhang; Weiming Lu

AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs

Qingqing Lyu, Linjuan Wu, Yongliang Shen, Hengwei Liu, Hao Li, Shengpei Jiang, Yin Zhang, Weiming Lu

Abstract

Despite the rapid progress of LLMs, their evaluation remains hindered by static, manually curated benchmarks with limited task coverage and poor adaptability to emerging domains. Existing automated approaches typically operate within fixed task schemas and often fail to autonomously discover new evaluation dimensions, limiting both scalability and effectiveness. To address these gaps, we propose AutoTaskEval, an automated framework that constructs domain-specific benchmarks directly from unstructured corpora. Using a refined Bloom’s Taxonomy, the framework systematically discovers tasks, enriches contextual grounding via iterative Socratic prompting, and generates diverse, progressively challenging evaluation instances. Applied to the complex and knowledge-intensive legal domain, AutoTaskEval uncovers a broader and more fine-grained task space than expert-curated benchmarks while producing high-quality instances that preserve established model-level evaluation trends. We further validate its robustness in a low-structure e-commerce review domain. Together, these results show that AutoTaskEval enables scalable, adaptive, and high-fidelity LLM assessment across domains and model families, advancing autonomous and capability-sensitive evaluation.

Anthology ID:: 2026.acl-long.280
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6191–6223
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.280/
DOI:
Bibkey:
Cite (ACL):: Qingqing Lyu, Linjuan Wu, Yongliang Shen, Hengwei Liu, Hao Li, Shengpei Jiang, Yin Zhang, and Weiming Lu. 2026. AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6191–6223, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs (Lyu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.280.pdf
Checklist:: 2026.acl-long.280.checklist.pdf

PDF Cite Search Checklist Fix data