Haiyuan Wan

2026

Existing methods for enhancing the inductive reasoning of large language models (LLMs) at test-time typically depend on iterative self-refinement of hypotheses, which lacks explicit optimization guidance and effective error correction. This often results in superficial rewording and the accumulation of errors. To overcome these limitations, we propose MATSIR, a plug-and-play test-time framework integrating Multi-Agent coordination with Monte Carlo Tree Search to improve Inductive Reasoning. MATSIR incorporates a dual-reward mechanism that provides explicit refinement signals, promoting logically coherent and semantically enriched hypotheses rather than mere rephrasing. Furthermore, it enables trajectory-level error correction through backtracking and pruning, allowing the system to recover from erroneous intermediate hypotheses. Experiments on five benchmarks across four LLMs show that MATSIR consistently outperforms previous best methods, yielding the highest average improvement of +4.9% on QWQ-32B and all-round improvement on Deepseek-V3. Our findings highlight that explicit guided search with built-in error correction is essential for advancing inductive reasoning in LLMs. Code is available at https://github.com/SolarWindRider/MATSIR

2025

pdf bib abs

We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose **RECALL**, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.

Co-authors

Jing Li 1

Yue Ma 1

Venues

EMNLP1
Findings1

Fix author