Shuo Shang
2026
PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning
Ruixiang Feng | Yuntao Wen | Silin Zhou | Ke Shi | Yifan Wang | Ran Le | Zhenwei An | Zongchao Chen | Chen Yang | Guangyue Peng | Yiming Jia | Dongsheng Wang | Tao Zhang | Lisi Chen | Yang Song | Shen Gao | Shuo Shang
Findings of the Association for Computational Linguistics: ACL 2026
Ruixiang Feng | Yuntao Wen | Silin Zhou | Ke Shi | Yifan Wang | Ran Le | Zhenwei An | Zongchao Chen | Chen Yang | Guangyue Peng | Yiming Jia | Dongsheng Wang | Tao Zhang | Lisi Chen | Yang Song | Shen Gao | Shuo Shang
Findings of the Association for Computational Linguistics: ACL 2026
Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from "overthinking", producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose PACE, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that PACE achieves a substantial reduction in token usage (up to 55.7%) while simultaneously improving accuracy (up to 4.1%) on math benchmarks, with generalization ability to code, science, and general domains.
DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents
JunShuo Zhang | Chengrui Huang | Feng Guo | Zihan Li | Ke Shi | Menghua Jiang | Jiguo Yu | Shuo Shang | Shen Gao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
JunShuo Zhang | Chengrui Huang | Feng Guo | Zihan Li | Ke Shi | Menghua Jiang | Jiguo Yu | Shuo Shang | Shen Gao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language model (LLM) agents that follow the sequential “reason-then-act” paradigm have achieved superior performance in many complex tasks. However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Build upon this paradigm, we further propose Diverse Parallel Exploration Policy Optimization (DPEPO), a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines.
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
Siqi Fan | Xiusheng Huang | Yiqun Yao | Xuezhi Fang | Kang Liu | Peng Han | Shuo Shang | Aixin Sun | Yequan Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siqi Fan | Xiusheng Huang | Yiqun Yao | Xuezhi Fang | Kang Liu | Peng Han | Shuo Shang | Aixin Sun | Yequan Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors—hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LifeState-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets—Hamlet and a synthetic script collection—rich in narrative structure and character interactions. Our fact-checking evaluation probes models’ self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that non-parametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.
2025
DNASpeech: A Contextualized and Situated Text-to-Speech Dataset with Dialogues, Narratives and Actions
Chuanqi Cheng | Hongda Sun | Bo Du | Shuo Shang | Xinrong Hu | Rui Yan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chuanqi Cheng | Hongda Sun | Bo Du | Shuo Shang | Xinrong Hu | Rui Yan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In this paper, we propose contextualized and situated text-to-speech (CS-TTS), a novel TTS task to promote more accurate and customized speech generation using prompts with Dialogues, Narratives, and Actions (DNA). While prompt-based TTS methods facilitate controllable speech generation, existing TTS datasets lack situated descriptive prompts aligned with speech data. To address this data scarcity, we develop an automatic annotation pipeline enabling multifaceted alignment among speech clips, content text, and their respective descriptions. Based on this pipeline, we present DNASpeech, a novel CS-TTS dataset with high-quality speeches with DNA prompt annotations. DNASpeech contains 2,395 distinct characters, 4,452 scenes, and 22,975 dialogue utterances, along with over 18 hours of high-quality speech recordings. To accommodate more specific task scenarios, we establish a leaderboard featuring two new subtasks for evaluation: CS-TTS with narratives and CS-TTS with dialogues. We also design an intuitive baseline model for comparison with existing state-of-the-art TTS methods on our leaderboard. Comprehensive experimental results demonstrate the quality and effectiveness of DNASpeech, validating its potential to drive advancements in the TTS field.
CESRec: Constructing Pseudo Interactions for Sequential Recommendation via Conversational Feedback
Yifan Wang | Shen Gao | Jiabao Fang | Rui Yan | Billy Chiu | Shuo Shang
Findings of the Association for Computational Linguistics: EMNLP 2025
Yifan Wang | Shen Gao | Jiabao Fang | Rui Yan | Billy Chiu | Shuo Shang
Findings of the Association for Computational Linguistics: EMNLP 2025
Sequential Recommendation Systems (SRS) have become essential in many real-world applications. However, existing SRS methods often rely on collaborative filtering signals and fail to capture real-time user preferences, while Conversational Recommendation Systems (CRS) excel at eliciting immediate interests through natural language interactions but neglect historical behavior. To bridge this gap, we propose CESRec, a novel framework that integrates the long-term preference modeling of SRS with the real-time preference elicitation of CRS. We introduce semantic-based pseudo interaction construction, which dynamically updates users’ historical interaction sequences by analyzing conversational feedback, generating a pseudo-interaction sequence that seamlessly combines long-term and real-time preferences. Additionally, we reduce the impact of outliers in historical items that deviate from users’ core preferences by proposing dual alignment outlier items masking, which identifies and masks such items using semantic-collaborative aligned representations. Extensive experiments demonstrate that CESRec achieves state-of-the-art performance by boosting strong SRS models, validating its effectiveness in integrating conversational feedback into SRS.
TTPA: Token-level Tool-use Preference Alignment Training Framework with Fine-grained Evaluation
Chengrui Huang | Shen Gao | Zhengliang Shi | Dongsheng Wang | Shuo Shang
Findings of the Association for Computational Linguistics: EMNLP 2025
Chengrui Huang | Shen Gao | Zhengliang Shi | Dongsheng Wang | Shuo Shang
Findings of the Association for Computational Linguistics: EMNLP 2025
Existing tool-learning methods usually rely on supervised fine-tuning, they often overlook fine-grained optimization of internal tool call details, leading to limitations in preference alignment and error discrimination. To overcome these challenges, we propose **T**oken-level **T**ool-use **P**reference **A**lignment Training Framework (TTPA), a training paradigm for constructing token-level tool-use preference datasets that align LLMs with fine-grained preferences using a novel error-oriented scoring mechanism. TTPA first introduces reversed dataset construction, a method for creating high-quality, multi-turn tool-use datasets by reversing the generation flow. Additionally, we propose _Preference Oriented Tool-use Dataset Construction_ to capture fine-grained preferences by modeling token-level differences during generation. To address biases in scoring, we introduce the _Error-oriented Scoring Mechanism_, which quantifies tool-call errors and can be used as a training signal. Extensive experiments on three diverse benchmark datasets demonstrate that TTPA significantly improves tool-using performance while showing strong generalization ability across models and datasets.
CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis
Ruixiang Feng | Shen Gao | Xiuying Chen | Lisi Chen | Shuo Shang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ruixiang Feng | Shen Gao | Xiuying Chen | Lisi Chen | Shuo Shang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often exhibit a specific cultural bias, neglecting the values and linguistic diversity of low-resource regions. This cultural bias not only undermines universal equality but also risks reinforcing stereotypes and perpetuating discrimination. To address this, we propose CulFiT, a novel culturally-aware training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity. Our approach synthesizes diverse cultural-related questions, constructs critique data in multiple culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units for interpretable evaluation. We also introduce GlobalOpinionQA, a multilingual open-ended question-answering dataset designed to evaluate culturally-aware responses in a global context. Extensive experiments on three existing benchmarks and our GlobalOpinionQA demonstrate that CulFiT achieves state-of-the-art open-source model performance in cultural alignment and general reasoning.
Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement
Xiaoqing Zhang | Yuhan Liu | Flood Sung | Xiuying Chen | Shuo Shang | Rui Yan
Findings of the Association for Computational Linguistics: ACL 2025
Xiaoqing Zhang | Yuhan Liu | Flood Sung | Xiuying Chen | Shuo Shang | Rui Yan
Findings of the Association for Computational Linguistics: ACL 2025
Code generation is crucial in software engineering for automating the coding process efficiently. While test-time computation methods show promise, they suffer from high latency due to multiple computation rounds.To overcome this, we introduce ThinkCoder, a framework that combines thorough exploration with optimal refinement.The exploration phase diversifies the solution space by searching for potential solutions, followed by a refinement phase that enhances precision.This approach allows us to select the best solution through careful consideration before taking action, avoiding excessive trial and error.To further minimize test-time computation overhead, we introduce preference-driven optimization with Reinforced Self-Training (ReST), which uses exploration trajectories from ThinkCoder to guide LLM’s evolution.This approach enhances LLM’s exploration efficiency via preference learning, cutting costs while maintaining accuracy.ThinkCoder boosts the performance with a single LLM, excelling on benchmarks like HumanEval and MBPP. Compared to SOTA models, it improves Pass@1 by 3.0% over MapCoder with just 6.4% of the computation cost.Against AgentCoder, ThinkCoder achieves a 0.5% higher Pass@1 after 2 rounds, outperforming AgentCoder’s 5 rounds.Additionally, ReST with success trajectories enhances efficiency, allowing models like LLaMA2-7B to achieve competitive results using only 20% of the computational resources. These results highlight the framework’s effectiveness and scalability.
Position-Aware Depth Decay Decoding (D3): Boosting Large Language Model Inference Efficiency
Siqi Fan | Xuezhi Fang | Xingrun Xing | Peng Han | Shuo Shang | Yequan Wang
Findings of the Association for Computational Linguistics: ACL 2025
Siqi Fan | Xuezhi Fang | Xingrun Xing | Peng Han | Shuo Shang | Yequan Wang
Findings of the Association for Computational Linguistics: ACL 2025
Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline.In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance.We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding (), which leverages a power-law decay function, ⌊ L × (𝛼i) ⌋, to determine the number of layers to retain when generating token Ti. Remarkably, without any retraining, the achieves success across a wide range of generation tasks for the first time.Experiments on large language models (the Llama) with 7 ∼ 70 billion parameters show that can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop (<1%) on the GSM8K and BBH benchmarks.
Lock on Target! Precision Unlearning via Directional Control
Yuntao Wen | Ruixiang Feng | Feng Guo | Yifan Wang | Ran Le | Yang Song | Shen Gao | Shuo Shang
Findings of the Association for Computational Linguistics: EMNLP 2025
Yuntao Wen | Ruixiang Feng | Feng Guo | Yifan Wang | Ran Le | Yang Song | Shen Gao | Shuo Shang
Findings of the Association for Computational Linguistics: EMNLP 2025
The unlearning method aims at effectively removing harmful, sensitive, or outdated knowledge without costly retraining the model. However, existing methods suffer from two critical limitations: (1) collateral forgetting, where erasing target data inadvertently removes related but desirable knowledge, and (2) generality forgetting, where aggressive unlearning degrades the model’s general capabilities. To address these challenges, we propose DirectiOn Guide unlEarning (DOGE), a novel method that enables precise knowledge erasure by identifying and leveraging a targeted “unlearning direction” in the model’s parameter space. DOGE first extracts this direction through differential analysis of representations for forgotten and retained samples, pinpointing the exact subspace associated with unwanted knowledge. It then selectively applies updates along this direction, ensuring minimal interference with retained information and general model performance. Experiments across multiple benchmarks demonstrate that Doge achieves state-of-the-art unlearning precision while preserving both related knowledge and general capabilities.
More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives
Xiaoqing Zhang | Ang Lv | Yuhan Liu | Flood Sung | Wei Liu | Jian Luan | Shuo Shang | Xiuying Chen | Rui Yan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiaoqing Zhang | Ang Lv | Yuhan Liu | Flood Sung | Wei Liu | Jian Luan | Shuo Shang | Xiuying Chen | Rui Yan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce DrICL, a novel optimization method that enhances model performance through Differentiated and Reweighting objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data.Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the Many-Shot ICL Benchmark (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes.Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios.We release the code and dataset hoping to facilitate further research in many-shot ICL.
2024
MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
Qinzhuo Wu | Weikai Xu | Wei Liu | Tao Tan | Liujian Liujianfeng | Ang Li | Jian Luan | Bin Wang | Shuo Shang
Findings of the Association for Computational Linguistics: EMNLP 2024
Qinzhuo Wu | Weikai Xu | Wei Liu | Tao Tan | Liujian Liujianfeng | Ang Li | Jian Luan | Bin Wang | Shuo Shang
Findings of the Association for Computational Linguistics: EMNLP 2024
Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
Quan Tu | Shilong Fan | Zihang Tian | Tianhao Shen | Shuo Shang | Xin Gao | Rui Yan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Quan Tu | Shilong Fan | Zihang Tian | Tianhao Shen | Shuo Shang | Xin Gao | Rui Yan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 11,376 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. To facilitate the convenient evaluation for these subjective metrics in CharacterEval, we further developed CharacterRM, a role-playing reward model based on human annotations, which has a higher correlation with human judgment compared to GPT-4. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation.
360∘REA: Towards A Reusable Experience Accumulation with 360∘ Assessment for Multi-Agent System
Shen Gao | Hao Li | Zhengliang Shi | Chengrui Huang | Quan Tu | Shuo Shang | Zhiliang Tian | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2024
Shen Gao | Hao Li | Zhengliang Shi | Chengrui Huang | Quan Tu | Shuo Shang | Zhiliang Tian | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2024
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
Shihan Deng | Weikai Xu | Hongda Sun | Wei Liu | Tao Tan | Jianfeng Liu | Ang Li | Jian Luan | Bin Wang | Rui Yan | Shuo Shang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shihan Deng | Weikai Xu | Hongda Sun | Wei Liu | Tao Tan | Jianfeng Liu | Ang Li | Jian Luan | Bin Wang | Rui Yan | Shuo Shang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction.However, there is a scarcity of benchmarks available for LLM-based mobile agents.Benchmarking these agents generally faces three main challenges:(1) The inefficiency of UI-only operations imposes limitations to task evaluation.(2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents.(3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents.First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion.Subsequently, we collect evaluation data by combining real user queries with augmentation from LLMs.To better evaluate different levels of planning capabilities for mobile agents, our data is categorized into three distinct groups: SAST, SAMT, and MAMT, reflecting varying levels of task complexity. Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios.Furthermore, we introduce a more accurate evaluation metric, named CheckPoint, to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps. Dataset and platform will be released in the future.
“In-Dialogues We Learn”: Towards Personalized Dialogue Without Pre-defined Profiles through In-Dialogue Learning
Chuanqi Cheng | Quan Tu | Wei Wu | Shuo Shang | Cunli Mao | Zhengtao Yu | Rui Yan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Chuanqi Cheng | Quan Tu | Wei Wu | Shuo Shang | Cunli Mao | Zhengtao Yu | Rui Yan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Personalized dialogue systems have gained significant attention in recent years for their ability to generate responses in alignment with different personas. However, most existing approaches rely on pre-defined personal profiles, which are not only time-consuming and labor-intensive to create but also lack flexibility. We propose In-Dialogue Learning (IDL), a fine-tuning framework that enhances the ability of pre-trained large language models to leverage dialogue history to characterize persona for personalized dialogue generation tasks without pre-defined profiles. Our experiments on three datasets demonstrate that IDL brings substantial improvements, with BLEU and ROUGE scores increasing by up to 200% and 247%, respectively. Additionally, the results of human evaluations further validate the efficacy of our proposed method.
DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to Determinacy
Hongda Sun | Weikai Xu | Wei Liu | Jian Luan | Bin Wang | Shuo Shang | Ji-Rong Wen | Rui Yan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hongda Sun | Weikai Xu | Wei Liu | Jian Luan | Bin Wang | Shuo Shang | Ji-Rong Wen | Rui Yan
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in large language models (LLMs) have revolutionized the landscape of reasoning tasks. To enhance the capabilities of LLMs to emulate human reasoning, prior studies have focused on modeling reasoning steps using various thought structures like chains, trees, or graphs. However, LLM-based reasoning still encounters the following challenges: (1) Limited adaptability of preset structures to diverse tasks; (2) Insufficient precision in exploiting known conditions to derive new ones; and (3) Inadequate consideration of historical reasoning experiences for subsequent reasoning steps. To this end, we propose DetermLR, a novel perspective that rethinks the reasoning process as an evolution from indeterminacy to determinacy. First, we categorize known conditions into two types: determinate and indeterminate premises, facilitating the transformation process. Subsequently, we leverage quantitative measurements to prioritize more relevant premises to explore new insights. Furthermore, we automate the storage and extraction of available premises and reasoning paths with reasoning memory, preserving historical reasoning details for subsequent reasoning steps. Comprehensive experimental results demonstrate that DetermLR surpasses all baselines on various logical reasoning benchmarks: LogiQA, ProofWriter, FOLIO, PrOntoQA, and LogicalDeduction. Compared to previous multi-step reasoning methods, DetermLR achieves higher accuracy with fewer reasoning steps, highlighting its superior efficiency and effectiveness in solving logical reasoning tasks.
Search
Fix author
Co-authors
- Shen Gao 7
- Rui Yan 5
- Jian Luan 4
- Xiuying Chen 3
- Ruixiang Feng 3
- Chengrui Huang 3
- Wei Liu 3
- Hongda Sun 3
- Quan Tu 3
- Yifan Wang 3
- Bin Wang 3
- Weikai Xu 3
- Rui Yan 3
- Lisi Chen 2
- Chuanqi Cheng 2
- Siqi Fan 2
- Xuezhi Fang 2
- Feng Guo 2
- Peng Han 2
- Ran Le 2
- Ang Li 2
- Yuhan Liu 2
- Ke Shi 2
- Zhengliang Shi 2
- Flood Sung 2
- Tao Tan 2
- Dongsheng Wang 2
- Yequan Wang 2
- Yuntao Wen 2
- Xiaoqing Zhang 2
- Zhenwei An 1
- Zongchao Chen 1
- Billy Chiu 1
- Shihan Deng 1
- Bo Du 1
- Shilong Fan 1
- Jiabao Fang 1
- Xin Gao 1
- Xinrong Hu 1
- Minlie Huang 1
- Xiusheng Huang 1
- Yiming Jia 1
- Menghua Jiang 1
- Zihan Li 1
- Hao Li 1
- Jianfeng Liu 1
- Wei Liu 1
- Kang Liu 1
- Liujian Liujianfeng 1
- Ang Lv 1
- Cunli Mao 1
- Guangyue Peng 1
- Tianhao Shen 1
- Yang Song 1
- Yang Song 1
- Aixin Sun 1
- Zihang Tian 1
- Zhiliang Tian 1
- Ji-Rong Wen 1
- Qinzhuo Wu 1
- Wei Wu 1
- Xingrun Xing 1
- Chen Yang 1
- Yiqun Yao 1
- Jiguo Yu 1
- Zhengtao Yu (余正涛) 1
- Tao Zhang 1
- JunShuo Zhang 1
- Silin Zhou 1