Yi Bai
2026
EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning
Chuanrui Hu | Xingze Gao | Zuyi Zhou | Dannong Xu | Yi Bai | Xintong Li | Hui Zhang | Tong Li | Chong Zhang | Lidong Bing | Yafeng Deng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chuanrui Hu | Xingze Gao | Zuyi Zhou | Dannong Xu | Yi Bai | Xintong Li | Hui Zhang | Tong Li | Chong Zhang | Lidong Bing | Yafeng Deng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) are increasingly deployed as long-term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems for LLMs often store isolated records and retrieve fragments, limiting their ability to consolidate evolving experience and resolve conflicts. We introduce EverMemOS, a self-organizing memory operating system that implements an engram-inspired lifecycle for computational memory. First, Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time-bounded foresight. Second, Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Finally, Reconstructive Recollection performs MemScene-guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo, LongMemEval, and PersonaMem-v2 show that EverMemOS significantly outperforms state-of-the-art methods on memory-augmented reasoning tasks.
2025
SLIM: Subtrajectory-Level Elimination for More Effective Reasoning
Xifeng Yao | Chengyuan Ma | Dongyu Lang | Yinhao Ni | Zhiwei Xu | Huarui Xie | Zihao Chen | Guang Shen | Dandan Tu | Yi Bai | Changzheng Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Xifeng Yao | Chengyuan Ma | Dongyu Lang | Yinhao Ni | Zhiwei Xu | Huarui Xie | Zihao Chen | Guang Shen | Dandan Tu | Yi Bai | Changzheng Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
In recent months, substantial progress has been made in complex reasoning of Large Language Models (LLMs), particularly through the application of test-time scaling. Notable examples include, though are not limited to, OpenAI’s o1/o3/o4 series and DeepSeek-R1. When responding to a query, these models generate an extended reasoning trajectory, during which the model explores, reflects, backtracks, and self-verifies before arriving at a conclusion. However, fine-tuning models with such reasoning trajectories may not always be optimal. Our findings indicate that not all components within these reasoning trajectories contribute positively to the reasoning process; in fact, some components may affect the overall performance negatively. In this study, we divide a reasoning trajectory into individual subtrajectories and develop a “5+2” framework to: (1) systematically identify suboptimal subtrajectories within the reasoning trajectory based on five human-established criteria; (2) assess the independence of the suboptimal subtrajectories identified in (1) from the subsequent content, ensuring that their elimination does not compromise overall flow and coherence of the reasoning process. Additionally, a sampling algorithm, built upon the “5+2” framework, is employed to select data whose reasoning process is free from suboptimal subtrajectories to the highest degree. Experimental results demonstrate that our method can reduce the number of suboptimal subtrajectories by 25.9% during the inference. Furthermore, our method achieves an average accuracy of 58.92% on highly challenging AIME24, AIME25, AMC24 and MATH500 benchmarks with only two thirds of training data, surpassing the average accuracy of 58.06% achieved with the entire data, and outperforming open-source datasets, including s1K-1.1, Light-R1-SFT-stage-1, OpenR1-Math-94k, and OpenThoughts-114k, when fine-tuning Qwen2.5-Math-7B. Finally, we have validated the efficacy of our method under resource-constrained scenarios, where it exhibits performance improvements across different maximum inference token limits: 2k, 4k, 8k, and 16k tokens.