Rongman Xu
2026
Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents
Yifei Li | Weidong Guo | Lingling Zhang | Rongman Xu | Muye Huang | Hui Liu | Lijiao Xu | Yu Xu | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifei Li | Weidong Guo | Lingling Zhang | Rongman Xu | Muye Huang | Hui Liu | Lijiao Xu | Yu Xu | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long-term conversational memory is a core capability for LLM-baseddialogue systems, yet existing benchmarks and evaluation protocolsprimarily focus on surface-level factual recall.In realistic interactions, appropriate responses often depend onimplicit constraints such as user state, goals, or values that are notexplicitly queried later.To evaluate this setting, we introduce LoCoMo-Plus, a benchmarkfor assessing cognitive memory under cue–trigger semantic disconnect,where models must retain and apply latent constraints across longconversational contexts.We further show that conventional string-matching metrics and explicittask-type prompting are misaligned with such scenarios, and propose aunified evaluation framework based on constraint consistency.Experiments across diverse backbone models, retrieval-based methods, andmemory systems demonstrate that cognitive memory remains challenging andreveals failures not captured by existing benchmarks.Our code and evaluation framework are publicly available at https://github.com/xjtuleeyf/Locomo-Plus.
MUR: Momentum Uncertainty guided Reasoning for Large Language Models
Hang Yan | Fangzhi Xu | Rongman Xu | Yifei Li | Jian Zhang | Haoran Luo | Xiaobao Wu | Anh Tuan Luu | Haiteng Zhao | Qika Lin | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hang Yan | Fangzhi Xu | Rongman Xu | Yifei Li | Jian Zhang | Haoran Luo | Xiaobao Wu | Anh Tuan Luu | Haiteng Zhao | Qika Lin | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking—wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM TTS without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating step-wise uncertainty over time. To support flexible inference-time control, we introduce -control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 45% on average while improving accuracy by 0.33–3.46%.