Hengwei Liu
2026
AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs
Qingqing Lyu | Linjuan Wu | Yongliang Shen | Hengwei Liu | Hao Li | Shengpei Jiang | Yin Zhang | Weiming Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qingqing Lyu | Linjuan Wu | Yongliang Shen | Hengwei Liu | Hao Li | Shengpei Jiang | Yin Zhang | Weiming Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the rapid progress of LLMs, their evaluation remains hindered by static, manually curated benchmarks with limited task coverage and poor adaptability to emerging domains. Existing automated approaches typically operate within fixed task schemas and often fail to autonomously discover new evaluation dimensions, limiting both scalability and effectiveness. To address these gaps, we propose AutoTaskEval, an automated framework that constructs domain-specific benchmarks directly from unstructured corpora. Using a refined Bloom’s Taxonomy, the framework systematically discovers tasks, enriches contextual grounding via iterative Socratic prompting, and generates diverse, progressively challenging evaluation instances. Applied to the complex and knowledge-intensive legal domain, AutoTaskEval uncovers a broader and more fine-grained task space than expert-curated benchmarks while producing high-quality instances that preserve established model-level evaluation trends. We further validate its robustness in a low-structure e-commerce review domain. Together, these results show that AutoTaskEval enables scalable, adaptive, and high-fidelity LLM assessment across domains and model families, advancing autonomous and capability-sensitive evaluation.
Leveraging Outline-Optimized Generative Interactions and Critique for Self-Refining Outlines with Reinforcement Learning
Hengwei Liu | Haoyuan Ma | Qingqing Lyu | Daoxin Zhang | Yao Hu | Yongliang Shen | Yin Zhang | Weiming Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hengwei Liu | Haoyuan Ma | Qingqing Lyu | Daoxin Zhang | Yao Hu | Yongliang Shen | Yin Zhang | Weiming Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long-form outline generation requires satisfying multiple competing objectives simultaneously: outlines must be engaging, well-organized, topically relevant, and comprehensive while maintaining logical consistency across hierarchical structures. Current approaches either rely on expensive multi-turn interactions with large language models or employ procedural refinement pipelines that cannot systematically learn from critique. We present Logic-RL, a framework that transforms critique-guided outline refinement into a learnable policy through reinforcement learning. Our approach constructs refinement trajectories from teacher demonstrations, synthesizes explicit reasoning chains that decompose the critique-revision process, and optimizes a refinement policy using group relative policy optimization with structure-aware rewards. Experiments on FreshWiki and WikiOutline demonstrate that Logic-RL achieves substantial improvements over strong baselines, with the 0.6B model obtaining 79.17% relative gain and the 1.7B model achieving 8.67% improvement in average rubric scores compared to the best existing methods. Further analysis reveals that learned refinement policies generalize across domains and can be iteratively applied, with quality continuing to improve through three refinement rounds before diminishing returns.
2025
Logic: Long-form Outline Generation via Imitative and Critical Self-refinement
Hengwei Liu | Yongliang Shen | Zhe Zheng | Haoyuan Ma | Xingyu Wu | Yin Zhang | Weiming Lu
Findings of the Association for Computational Linguistics: EMNLP 2025
Hengwei Liu | Yongliang Shen | Zhe Zheng | Haoyuan Ma | Xingyu Wu | Yin Zhang | Weiming Lu
Findings of the Association for Computational Linguistics: EMNLP 2025
Long-form outline generation for expository articles requires both comprehensive knowledge coverage and logical coherence, which is essential for creating detailed Wikipedia-like content. However, existing methods face critical limitations: outlines generated in the pre-writing stage often have low knowledge density and lack detail, while retrieval-augmented approaches struggle to maintain logical coherence across retrieved information. Additionally, unlike human writers who can iteratively improve through peer feedback and reference similar topics, current approaches lack effective mechanisms for systematic outline refinement. To address these challenges, we propose Logic, a Long-form Outline Generation system via Imitative and Critical self-refinement that mimics human writers’ refinement process. Logic establishes a coherent planning framework and structured knowledge base, learns from similar topic outlines through imitation, and continuously improves through model-based critique. Experiments on FreshWiki and our dataset WikiOutline show that, compared to the best baseline, Logic’s long-form outlines are more organized (with increases of 22.85% and 21.65% respectively) and more logically coherent (with increases of 16.19% and 12.24% respectively). Human evaluation further validates Logic’s effectiveness in generating comprehensive and well-structured long-form outlines.
DB-Explore: Automated Database Exploration and Instruction Synthesis for Text-to-SQL
Haoyuan Ma | Yongliang Shen | Hengwei Liu | Wenqi Zhang | Haolei Xu | Qiuying Peng | Jun Wang | Weiming Lu
Findings of the Association for Computational Linguistics: EMNLP 2025
Haoyuan Ma | Yongliang Shen | Hengwei Liu | Wenqi Zhang | Haolei Xu | Qiuying Peng | Jun Wang | Weiming Lu
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent text-to-SQL systems powered by large language models (LLMs) have demonstrated remarkable performance in translating natural language queries into SQL.However, these systems often struggle with complex database structures and domain-specific queries, as they primarily focus on enhancing logical reasoning and SQL syntax while overlooking the critical need for comprehensive database understanding.To address this limitation, we propose DB-Explore, a novel framework that systematically aligns LLMs with database knowledge through automated exploration and instruction synthesis.DB-Explore constructs database graphs to capture complex relational schemas, leverages GPT-4 to systematically mine structural patterns and semantic knowledge, and synthesizes instructions to distill this knowledge for efficient fine-tuning of LLMs.Our framework enables comprehensive database understanding through diverse sampling strategies and automated instruction generation, bridging the gap between database structures and language models.Experiments conducted on the SPIDER and BIRD benchmarks validate the effectiveness of DB-Explore, achieving an execution accuracy of 67.0% on BIRD and 87.8% on SPIDER. Notably, our open‐source implementation based on Qwen2.5‐Coder‐7B achieves state‐of‐the‐art results at minimal computational cost, outperforming several GPT‐4‐driven Text‐to‐SQL systems.