Bloom-Eval: A Hierarchical Evaluation Benchmark for Automatic Survey Generation Based on Bloom’s Taxonomy

Fei Zhang; Zhe Zhao; HaiBin Wen; Tianshuo Wei; Zaixi Zhang; Chao Yang; Ye Wei

Bloom-Eval: A Hierarchical Evaluation Benchmark for Automatic Survey Generation Based on Bloom’s Taxonomy

Fei Zhang, Zhe Zhao, HaiBin Wen, Tianshuo Wei, Zaixi Zhang, Chao Yang, Ye Wei

Abstract

The rapid advance of automatic survey generation (ASG) has created a critical evaluation challenge. Existing evaluation methods suffer from both cognitive dimensional simplification and methodological unreliability, primarily due to the over-reliance on the ”LLM-as-a-Judge” approach. To bridge this gap, we establish Bloom-Eval, a six-tiered benchmark based on Bloom’s Taxonomy that reliably evaluates ASG systems by prioritizing deterministic algorithms and introducing our GRADE approach for abstract abilities. Furthermore, we construct a large-scale, cross-disciplinary dataset of over 3,000 high-quality papers. Our empirical study on this benchmark reveals that while leading ASG systems are proficient format organizers, they remain unqualified knowledge integrators. This work aims to redefine ASG evaluation standards, shifting the research focus from the formal mimicry of surface structure to the cognitive deepening of intellectual content. Our method provides the ASG field with a systematic, reproducible, and theoretically grounded benchmark to guide future research.

Anthology ID:: 2026.acl-long.1315
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 28512–28544
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1315/
DOI:
Bibkey:
Cite (ACL):: Fei Zhang, Zhe Zhao, HaiBin Wen, Tianshuo Wei, Zaixi Zhang, Chao Yang, and Ye Wei. 2026. Bloom-Eval: A Hierarchical Evaluation Benchmark for Automatic Survey Generation Based on Bloom’s Taxonomy. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28512–28544, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Bloom-Eval: A Hierarchical Evaluation Benchmark for Automatic Survey Generation Based on Bloom’s Taxonomy (Zhang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1315.pdf
Checklist:: 2026.acl-long.1315.checklist.pdf

PDF Cite Search Checklist Fix data