@inproceedings{zheng-etal-2025-stepsearch,
    title = "{S}tep{S}earch: Igniting {LLM}s Search Ability via Step-Wise Proximal Policy Optimization",
    author = "Zheng, Xuhui  and
      An, Kang  and
      Wang, Ziliang  and
      Wang, Yuhang  and
      Wu, Yichao",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1106/",
    pages = "21816--21841",
    ISBN = "979-8-89176-332-6",
    abstract = "Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the \textit{sparse rewards from global signal only}. To address this gap in existing research, we introduce \textbf{StepSearch}, a framework for search LLMs that trained with \textit{step-wise} proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving \textbf{11.2{\%}} and \textbf{4.2{\%}} absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. The project is open source at https://github.com/Zillwang/StepSearch"
}Markdown (Informal)
[StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization](https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1106/) (Zheng et al., EMNLP 2025)
ACL