Yibin Chen

2025

pdf bib abs
War of Thoughts: Competition Stimulates Stronger Reasoning in Large Language Models
Yibin Chen | Jinyi Liu | Yan Zheng | Yifu Yuan | Jianye Hao
Findings of the Association for Computational Linguistics: ACL 2025

Recent advances in Large Language Models (LLMs) have reshaped the landscape of reasoning tasks, particularly through test-time scaling (TTS) to enhance LLM reasoning. Prior research has used structures such as trees or graphs to guide LLMs in searching for optimal solutions. These methods are time-consuming and require a strong reward model (RM) to support effective solution space exploration. Tournament-style approaches eliminate the reliance on RMs through comparative evaluation but suffer from transitivity dilemmas, leading to unstable ordering. To address these issues, we propose War of Thoughts (**WoT**), a novel post-hoc method that enhances reasoning without finetuning. WoT comprises two distinct stages: (1) *Exploration*, in which diverse and meaningful candidate solutions are generated through contrastive demonstrations and multi-granularity reasoning specifications; and (2) *Competition*, where these candidate solutions are subjected to multiple rounds of matchups within a competitive arena. Throughout this iterative process, the solutions are optimized and improved, with the optimal solution being determined based on Elo ratings. Extensive experiments across various LLMs demonstrate the superiority of WoT, surpassing baselines by **10–30%**. WoT can effectively stimulate stronger reasoning abilities, achieving impressive TTS performance in both generation budget and model size. It shows higher scalability efficiency compared to the baseline within the same budget. Notably, WoT exhibits excellent scalability with model size, even outperforming a 72B model despite using a 7B model.

In inference-time scaling, Chain-of-Thought (CoT) plays a crucial role in enabling large language models (LLMs) to exhibit reasoning capabilities. However, in many scenarios, high-quality CoT data is scarce or even unavailable. In such cases, STaR-like methods can help LLMs synthesize CoT based on user queries and response, but they inevitably suffer from the risk of compounding errors. In this work, we tackle an even more challenging scenario: tool learning in the absence of user queries. We design a data scaling method using back-translation, which establishes an inference cycle to synthesize both user queries and CoT data. To reudce the compounding error of inference time, we introduce two rule-based verifiers to assess the validity of the synthesized CoT data. In particular, the Cycle Verifier facilitates performance improvement by continuously accumulating new data over multiple iterations. Our approach achieves a 75.4% pass rate and a 79.6% win rate using small models (7B) in StableToolBench. Notably, these results are obtained exclusively from self-synthesized high-quality data, without relying on external supervision or expert trajectories for warm-up.

2024

Recent studies have explored the working mechanisms of In-Context Learning (ICL). However, they mainly focus on classification and simple generation tasks, limiting their broader application to more complex generation tasks in practice. To address this gap, we investigate the impact of demonstrations on token representations within the practical alignment tasks. We find that the transformer embeds the task function learned from demonstrations into the separator token representation, which plays an important role in the generation of prior response tokens. Once the prior response tokens are determined, the demonstrations become redundant. Motivated by this finding, we propose an efficient Progressive In-Context Alignment (PICA) method consisting of two stages. In the first few-shot stage, the model generates several prior response tokens via standard ICL while concurrently extracting the ICL vector that stores the task function from the separator token representation. In the following zero-shot stage, this ICL vector guides the model to generate responses without further demonstrations. Extensive experiments demonstrate that our PICA not only surpasses vanilla ICL but also achieves comparable performance to other alignment tuning methods. The proposed training-free method reduces the time cost (e.g., 5.45×) with improved alignment performance (e.g., 6.57+). Consequently, our work highlights the application of ICL for alignment and calls for a deeper understanding of ICL for complex generations. The code will be available at https://github.com/HITsz-TMG/PICA.

Recent studies in Retrieval-Augmented Generation (RAG) have investigated extracting evidence from retrieved passages to reduce computational costs and enhance the final RAG performance, yet it remains challenging. Existing methods heavily rely on heuristic-based augmentation, encountering several issues: (1) Poor generalization due to hand-crafted context filtering; (2) Semantics deficiency due to rule-based context chunking; (3) Skewed length due to sentence-wise filter learning. To address these issues, we propose a model-based evidence extraction learning framework, SEER, optimizing a vanilla model as an evidence extractor with desired properties through self-aligned learning. Extensive experiments show that our method largely improves the final RAG performance, enhances the faithfulness, helpfulness, and conciseness of the extracted evidence, and reduces the evidence length by 9.25 times. The code will be available at https://github.com/HITsz-TMG/SEER.

As we all know, hallucinations prevail in Large Language Models (LLMs), where the generated content is coherent but factually incorrect, which inflicts a heavy blow on the widespread application of LLMs. Previous studies have shown that LLMs could confidently state non-existent facts rather than answering “I don’t know”. Therefore, it is necessary to resort to external knowledge to detect and correct the hallucinated content. Since manual detection and correction of factual errors is labor-intensive, developing an automatic end-to-end hallucination-checking approach is indeed a needful thing. To this end, we present Medico, a Multi-source evidence fusion enhanced hallucination detection and correction framework. It fuses diverse evidence from multiple sources, detects whether the generated content contains factual errors, provides the rationale behind the judgment, and iteratively revises the hallucinated content. Experimental results on evidence retrieval (0.964 HR@5, 0.908 MRR@5), hallucination detection (0.927-0.951 F1), and hallucination correction (0.973-0.979 approval rate) manifest the great potential of Medico. A video demo of Medico can be found at https://youtu.be/RtsO6CSesBI.

pdf bib abs
Achieving Stronger Generation via Simple Contrastive Tuning
Zhimeng Wang | Pinzheng Wang | Juntao Li | Yibin Chen | Min Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Instruction tuning is widely used to unlock the abilities of Large Language Models (LLMs) in following human instructions, resulting in substantial performance improvements across various downstream tasks.Furthermore, contrastive decoding methods are employed to enhance instruction-tuned models. To further explore the potential of contrastive decoding, we introduce the Contrastive Tuning and Decoding (CTD) framework, which enhances model performance without requiring additional data or significant computational resources.When performing Contrastive Tuning, we optimize a correction model by targeting discrepancies between the original outputs and labels. During Contrastive Decoding, the correction model adjusts the logits of the SFT model using the same input to ensure better adherence to instructions.With the lightweight CTD framework, we refine the behavior of instruction-tuned models, improving their performance on the challenging SUPNATINST dataset with unfamiliar data distributions across various models and prompt formats.