Haiyu Zhao

2026

In knowledge-intensive creative tasks, Large Language Models (LLMs) often generate outputs that extend beyond established knowledge, making direct verification against current evidence impractical. Unlike factual hallucinations checked against ground truth, such outputs arise naturally in creative generation, where extending beyond current knowledge is often the goal. Yet prior work debates whether hallucination should be suppressed or embraced without empirically analyzing this unverifiable subclass. On the ideation evaluation side, existing work focuses on individual outputs without characterizing the unverifiable space as a whole. To address this gap, we propose a novelty-verifiability characterization that distinguishes Creative Synthesis (Region A) from Groundless Fabrication (Region B), and study it through a conceptual creation task where LLMs synthesize novel scientific concepts. Through 32,400 generations across three technical domains and 1,080 human judgments, we find that Region A is non-negligible (4.7%) and robust, persisting across generation strategies, models, domains, and embedding choices. A retrospective recovery experiment further shows that LLMs can approximate post-cutoff scientific concepts in controlled combinatorial settings. Our findings suggest that the unverifiable space is not uniformly noise but exhibits empirically distinguishable internal structure, providing an empirical basis for more selective hallucination governance.[<https://github.com/YuLab1/llm-concept-creation>]

2025

pdf bib abs

StructuThink: Reasoning with Task Transition Knowledge for Autonomous LLM-Based Agents
Haiyu Zhao | Zhenyu Guo | Chunhong Zhang | Ziyu Zhou | Zheng Hu
Findings of the Association for Computational Linguistics: EMNLP 2025

Decision-making tasks have highlighted fundamental challenges in grounding decisions within real-world contexts. Traditional decision knowledge utilization methods often struggle to effectively integrate structured decision constraints, limiting their ability to decompose high-level tasks, maintain logical consistency, and adapt to dynamic environments. To bridge this gap, we introduce StructuThink, a knowledge-structured reasoning framework that enhances LLM-based agents with explicit decision constraints. Specifically, we propose the Task Transition Knowledge Graph (TTKG) that learning decision knowledge in embodied scenarios. Leveraging this knowledge, we propose the StructuThink framework, comprising a subtask chain constructor for grounding natural language instructions and a constraint-based executor for adaptive and consistent decision-making. We validate StructuThink across multiple benchmarks, including ALFWorld and WebShop, where it achieves higher task success rates (improving by up to 7%) and more efficient action sequences (requiring up to 15% fewer steps) than baseline methods. Our approach enables LLMs to more effectively ground decision-making in domain-specific scenarios, enhancing both interpretability and reliability, thus paving the way for more reliable and adaptable decision-making systems.

pdf bib abs

All That Glitters is Not Gold: Improving Robust Retrieval-Augmented Language Models with Fact-Centric Preference Alignment
Jia Hao | Chunhong Zhang | Jiarun Liu | Haiyu Zhao | Zhiqiang Zhan | Zheng Hu
Findings of the Association for Computational Linguistics: ACL 2025

Retrieval-augmented language model (RALM) relies on retrieved external knowledge to generate responses, resulting in vulnerability in the face of retrieval results with noisy documents. Previous works integrate additional filters or finetune Large Language Models (LLMs) to learn adaptive retrieval to reduce the performance damage of noisy documents. However, prior noise filtering may lead to the loss of crucial information, and these methods do not focus on distracting documents with high semantic relevance, which is the most challenging problem. In this study, we propose a training method for fact-centric preference alignment (FPA) to improve the ability of LLMs to directly extract useful information from noisy retrieval results without prior filtering. Our method performs positive document mining based on factual consistency and uses LLMs self-generated synthetic data as training data without manual annotation. We evaluate our FPA on four question answering benchmarks, and the experimental results demonstrate that our method achieves significant improvement with a small scale of training data.

Co-authors

Jia Hao 1

Yu Yan 1

Venues

Findings2
ACL1

Fix author