Sheng Guo
2026
Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning
Qiao Liang | Yuke Zhu | Chao Ge | Lei Yang | Ying Shen | Bo Zheng | Sheng Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiao Liang | Yuke Zhu | Chao Ge | Lei Yang | Ying Shen | Bo Zheng | Sheng Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code is publicly released for reproducibility at https://anonymous.4open.science/r/ELPO-7C19.
Towards Interpretable Tabular Reasoning: Enhancing LLM Reasoning on Tabular Data with Pre-Constructed Logic Graph
Lirong Gao | Zewei Yu | Zhongrui Yin | Qi Zhang | Yuke Zhu | Bo Zheng | Haobo Wang | Junbo Zhao | Gang Chen | Sheng Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lirong Gao | Zewei Yu | Zhongrui Yin | Qi Zhang | Yuke Zhu | Bo Zheng | Haobo Wang | Junbo Zhao | Gang Chen | Sheng Guo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tabular data is widely used in fields such as finance and healthcare. Traditional tree-based models are prevalent for tabular prediction tasks due to their ability to handle heterogeneous features. However, their heavy reliance on feature engineering limits both their generalizability and their human-readable interpretability. On the other hand, Large Language Models (LLMs) naturally provide intermediate reasoning steps, thus offering greater transparency in decision-making. Nevertheless, LLMs often fail to match the predictive performance of tree-based models on tabular data. To address these challenges, we propose a novel Logic-Graph-Enhanced LLM Reasoning (LogGER) framework that integrates the strengths of tree-based models and LLMs. Specifically, we reformulate the traditional decision tree as a human-readable logic graph, which explicitly models the causal relationships between features and targets. This logic graph is automatically constructed using LLMs based on data priors and serves as the foundation for LogGER. To fully leverage the logic graph, we further introduce a logic-graph-guided process supervision approach, which evaluates and enhances the quality of LLM’s intermediate reasoning steps using logic-graph-aided process reward. Extensive experiments demonstrate that LogGER consistently outperforms both tree-based models and state-of-the-art LLM methods on a variety of tabular prediction tasks, achieving superior accuracy and interpretability.
2025
LeTS: Learning to Think-and-Search via Process-and-Outcome Reward Hybridization
Qi Zhang | Shouqing Yang | Lirong Gao | Hao Chen | Xiaomeng Hu | Jinglei Chen | Jiexiang Wang | Sheng Guo | Bo Zheng | Haobo Wang | Junbo Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Qi Zhang | Shouqing Yang | Lirong Gao | Hao Chen | Xiaomeng Hu | Jinglei Chen | Jiexiang Wang | Sheng Guo | Bo Zheng | Haobo Wang | Junbo Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have demonstrated impressive capabilities in reasoning with the emergence of reasoning models like OpenAI-o1 and DeepSeek-R1. Recent research focuses on integrating reasoning capabilities into the realm of retrieval-augmented generation (RAG) via outcome-supervised reinforcement learning (RL) approaches, while the correctness of intermediate think-and-search steps is usually neglected. To address this issue, we design a process-level reward module to mitigate the unawareness of intermediate reasoning steps in outcome-level supervision without additional annotation. Grounded on this, we propose **Le**arning to **T**hink-and-**S**earch (**LeTS**), a novel framework that hybridizes stepwise process reward and outcome-based reward to current RL methods for RAG. Extensive experiments demonstrate the generalization and inference efficiency of **LeTS** across various RAG benchmarks. In addition, these results reveal the potential of process- and outcome-level reward hybridization in boosting LLMs’ reasoning ability via RL under other scenarios.
2022
RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning
Yaqing Wang | Xin Tian | Haoyi Xiong | Yueyang Li | Zeyu Chen | Sheng Guo | Dejing Dou
Findings of the Association for Computational Linguistics: NAACL 2022
Yaqing Wang | Xin Tian | Haoyi Xiong | Yueyang Li | Zeyu Chen | Sheng Guo | Dejing Dou
Findings of the Association for Computational Linguistics: NAACL 2022
Pre-trained language models (PLMs) can provide a good starting point for downstream applications. However, it is difficult to generalize PLMs to new tasks given a few labeled samples. In this work, we show that Relation Graph augmented Learning (RGL) can improve the performance of few-shot natural language understanding tasks. During learning, RGL constructs a relation graph based on the label consistency between samples in the same batch, and learns to solve the resultant node classification and link prediction problems on the relation graph. In this way, RGL fully exploits the limited supervised information, which can boost the tuning effectiveness. Extensive experimental results show that RGL consistently improves the performance of prompt-based tuning strategies.
2020
Exploring Contextual Word-level Style Relevance for Unsupervised Style Transfer
Chulun Zhou | Liangyu Chen | Jiachen Liu | Xinyan Xiao | Jinsong Su | Sheng Guo | Hua Wu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Chulun Zhou | Liangyu Chen | Jiachen Liu | Xinyan Xiao | Jinsong Su | Sheng Guo | Hua Wu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Unsupervised style transfer aims to change the style of an input sentence while preserving its original content without using parallel training data. In current dominant approaches, owing to the lack of fine-grained control on the influence from the target style, they are unable to yield desirable output sentences. In this paper, we propose a novel attentional sequence-to-sequence (Seq2seq) model that dynamically exploits the relevance of each output word to the target style for unsupervised style transfer. Specifically, we first pretrain a style classifier, where the relevance of each input word to the original style can be quantified via layer-wise relevance propagation. In a denoising auto-encoding manner, we train an attentional Seq2seq model to reconstruct input sentences and repredict word-level previously-quantified style relevance simultaneously. In this way, this model is endowed with the ability to automatically predict the style relevance of each output word. Then, we equip the decoder of this model with a neural style component to exploit the predicted wordlevel style relevance for better style transfer. Particularly, we fine-tune this model using a carefully-designed objective function involving style transfer, style relevance consistency, content preservation and fluency modeling loss terms. Experimental results show that our proposed model achieves state-of-the-art performance in terms of both transfer accuracy and content preservation.
2010
Search
Fix author
Co-authors
- Lirong Gao 2
- Haobo Wang 2
- Junbo Zhao 2
- Bo Zheng 2
- Yuke Zhu 2
- Gang Chen 1
- Hao Chen 1
- Jinglei Chen 1
- Liang-Yu Chen 1
- Zeyu Chen 1
- Dejing Dou 1
- Chao Ge 1
- Xiaomeng Hu 1
- Yueyang Li 1
- Qiao Liang 1
- Jiachen Liu 1
- Naren Ramakrishnan 1
- Ying Shen 1
- Jinsong Su 1
- Xin Tian 1
- Jiexiang Wang 1
- Yaqing Wang 1
- Hua Wu (吴华) 1
- Xinyan Xiao 1
- Haoyi Xiong 1
- Lei Yang 1
- Shouqing Yang 1
- Zhongrui Yin 1
- Zewei Yu 1
- Qi Zhang 1
- Qi Zhang 1
- Bo Zheng 1
- Chulun Zhou 1