Ye Wei

2026

The escalating demand for comprehensive literature surveys in rapidly evolving research areas makes manual writing increasingly impractical, underscoring the necessity of automation. Large Language Models (LLMs) provide a promising foundation for this task, yet guiding them to generate accurate, reliable content remains a fundamental challenge, as issues such as hallucinations and vague organization often persist. To address this, we propose FIKSurvey, a feedback-driven framework grounded in the idea that “Feedback is the key for automatic survey generation.” Specifically, FIKSurvey systematically incorporates feedback across three dimensions: outline feedback for structural clarity, citation feedback for evidence validation, and content feedback for readability and analytical depth. The framework also supports optional human-in-the-loop intervention for user-specific needs. Experiments confirm that FIKSurvey substantially improves both citation and content quality, demonstrating feedback as the critical mechanism for automatic survey generation.

pdf bib abs

Bloom-Eval: A Hierarchical Evaluation Benchmark for Automatic Survey Generation Based on Bloom’s Taxonomy
Fei Zhang | Zhe Zhao | HaiBin Wen | Tianshuo Wei | Zaixi Zhang | Chao Yang | Ye Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The rapid advance of automatic survey generation (ASG) has created a critical evaluation challenge. Existing evaluation methods suffer from both cognitive dimensional simplification and methodological unreliability, primarily due to the over-reliance on the ”LLM-as-a-Judge” approach. To bridge this gap, we establish Bloom-Eval, a six-tiered benchmark based on Bloom’s Taxonomy that reliably evaluates ASG systems by prioritizing deterministic algorithms and introducing our GRADE approach for abstract abilities. Furthermore, we construct a large-scale, cross-disciplinary dataset of over 3,000 high-quality papers. Our empirical study on this benchmark reveals that while leading ASG systems are proficient format organizers, they remain unqualified knowledge integrators. This work aims to redefine ASG evaluation standards, shifting the research focus from the formal mimicry of surface structure to the cognitive deepening of intellectual content. Our method provides the ASG field with a systematic, reproducible, and theoretically grounded benchmark to guide future research.

Co-authors

Venues

ACL1
Findings1

Fix author