Zijie Zhao
2026
Streaming Hallucination Detection in Long Chain-of-Thought Reasoning
Haolang Lu | Minghui Pan | Ripeng LI | Guoshun Nan | Jialin Zhuang | Zijie Zhao | Zhongxiang Sun | Kun Wang | Yang Liu
Findings of the Association for Computational Linguistics: ACL 2026
Haolang Lu | Minghui Pan | Ripeng LI | Guoshun Nan | Jialin Zhuang | Zijie Zhao | Zhongxiang Sun | Kun Wang | Yang Liu
Findings of the Association for Computational Linguistics: ACL 2026
Long chain-of-thought (CoT) reasoning improves the performance of large language models, yet hallucinations in such settings often emerge subtly and propagate across reasoning steps. We suggest that hallucination in long CoT reasoning is better understood as an evolving latent state rather than a one-off erroneous event. Accordingly, we treat step-level hallucination judgments as local observations and introduce a cumulative prefix-level hallucination signal that tracks the global evolution of the reasoning state over the entire trajectory. Overall, our approach enables streaming hallucination detection in long CoT reasoning, providing real-time, interpretable evidence.
CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic
Yaocheng Zhang | Haohuan Huang | Zijun Song | Zijie Zhao | Qichao Zhang | Yuanheng Zhu | Dongbin Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Yaocheng Zhang | Haohuan Huang | Zijun Song | Zijie Zhao | Qichao Zhang | Yuanheng Zhu | Dongbin Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Tool-Integrated Reasoning (TIR) with search engines enables large language models to iteratively retrieve up-to-date external knowledge, enhancing adaptability and generalization in complex question-answering tasks. However, existing search agent pipelines typically depend on reinforcement learning based optimization, which often suffers from sparse outcome rewards, leading to inefficient exploration and unstable training. We introduce CriticSearch, a fine-grained credit-assignment framework that supplies dense, turn-level feedback via a retrospective critic mechanism. During training, a frozen, asymmetric critique LLM retrospectively evaluates each turn using privileged information from the full trajectory and gold answers, converting these assessments into stable, dense rewards that guide policy improvement. Experimental results across diverse multi-hop reasoning benchmarks demonstrate that CriticSearch consistently outperforms existing baselines, achieving faster convergence, improved training stability, and higher performance.