Xuying Ning
2026
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
Yuanchen Bei | Tianxin Wei | Xuying Ning | Yanjun Zhao | Zhining Liu | Xiao Lin | Yada Zhu | Hendrik Hamann | Jingrui He | Hanghang Tong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuanchen Bei | Tianxin Wei | Xuying Ning | Yanjun Zhao | Zhining Liu | Xiao Lin | Yada Zhu | Hendrik Hamann | Jingrui He | Hanghang Tong
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across twelve memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models. Our benchmark and dataset are available at https://github.com/YuanchenBei/Mem-Gallery.
AdaFuse: Adaptive Ensemble Decoding for Large Language Models
Chengming Cui | Tianxin Wei | Ziyi Chen | Ruizhong Qiu | Zhichen Zeng | Zhining Liu | Xuying Ning | Duo Zhou | Jingrui He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chengming Cui | Tianxin Wei | Ziyi Chen | Ruizhong Qiu | Zhichen Zeng | Zhining Liu | Xuying Ning | Duo Zhou | Jingrui He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference-time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid-generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty-based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity-aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open-domain QA, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%.
Harnessing Consistency for Robust Test-Time LLM Ensemble
Zhichen Zeng | Qi Yu | Xiao Lin | Ruizhong Qiu | Xuying Ning | Tianxin Wei | Yuchen Yan | Jingrui He | Hanghang Tong
Findings of the Association for Computational Linguistics: EACL 2026
Zhichen Zeng | Qi Yu | Xiao Lin | Ruizhong Qiu | Xuying Ning | Tianxin Wei | Yuchen Yan | Jingrui He | Hanghang Tong
Findings of the Association for Computational Linguistics: EACL 2026
Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. *Token-level consistency* captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. *Model-level consistency* models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness. Our code is available at https://github.com/zhichenz98/CoRE-EACL26.
PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs
Yanjun Zhao | Tianxin Wei | Jiaru Zou | Xuying Ning | Yuanchen Bei | Lingjie Chen | Simmi Rana | Wendy H. Yang | Hanghang Tong | Jingrui He
Findings of the Association for Computational Linguistics: ACL 2026
Yanjun Zhao | Tianxin Wei | Jiaru Zou | Xuying Ning | Yuanchen Bei | Lingjie Chen | Simmi Rana | Wendy H. Yang | Hanghang Tong | Jingrui He
Findings of the Association for Computational Linguistics: ACL 2026
Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PaperMind , a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PaperMind is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PaperMind enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both open-source and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https://github.com/Yanjun-Zhao/PaperMind.
2025
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Junyu Zhang | Runpei Dong | Han Wang | Xuying Ning | Haoran Geng | Peihao Li | Xialin He | Yutong Bai | Jitendra Malik | Saurabh Gupta | Huan Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Junyu Zhang | Runpei Dong | Han Wang | Xuying Ning | Haoran Geng | Peihao Li | Xialin He | Yutong Bai | Jitendra Malik | Saurabh Gupta | Huan Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper presents AlphaOne (𝛼1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. 𝛼1 first introduces 𝛼 moment, which represents the scaled thinking phase with a universal parameter 𝛼.Within this scaled pre-𝛼 moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the 𝛼 moment, 𝛼1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate 𝛼1‘s superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/.
iAgent: LLM Agent as a Shield between User and Recommender Systems
Wujiang Xu | Yunxiao Shi | Zujie Liang | Xuying Ning | Kai Mei | Kun Wang | Xi Zhu | Min Xu | Yongfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Wujiang Xu | Yunxiao Shi | Zujie Liang | Xuying Ning | Kai Mei | Kun Wang | Xi Zhu | Min Xu | Yongfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform’s recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform’s benefits, which may hinder their ability to protect and capture users’ true interests. Second, these models are typically optimized using data from all users, which may overlook individual user’s preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure. To this end, we first construct four recommendation datasets, denoted as InstructRec, along with user instructions for each record. To understand user’s intention, we design an Instruction-aware Agent capable of using tools to acquire knowledge from external environments. Moreover, we introduce an Individual Instruction-aware Agent, which incorporates a dynamic memory mechanism to optimize from individual feedback. Results on four datasets demonstrate that consistently achieves an average improvement of 16.6% over SOTA baselines across ranking metrics. Moreover, iAgent mitigates echo chamber effects and effectively alleviates the model bias in disadvantaged users (less-active), serving as a shield between user and recommender systems.
Search
Fix author
Co-authors
- Jingrui He 4
- Tianxin Wei 4
- Hanghang Tong 3
- Yuanchen Bei 2
- Xiao Lin 2
- Zhining Liu 2
- Ruizhong Qiu 2
- Zhichen Zeng 2
- Yanjun Zhao 2
- Yutong Bai 1
- Ziyi Chen 1
- Lingjie Chen 1
- Chengming Cui 1
- Runpei Dong 1
- Haoran Geng 1
- Saurabh Gupta 1
- Hendrik Hamann 1
- Xialin He 1
- Peihao Li 1
- Zujie Liang 1
- Jitendra Malik 1
- Kai Mei 1
- Simmi Rana 1
- Yunxiao Shi 1
- Han Wang 1
- Kun Wang 1
- Wujiang Xu 1
- Min Xu 1
- Yuchen Yan 1
- Wendy H. Yang 1
- Qi Yu 1
- Junyu Zhang 1
- Huan Zhang 1
- Yongfeng Zhang 1
- Duo Zhou 1
- Yada Zhu 1
- Xi Zhu 1
- Jiaru Zou 1