Huacan Wang
2026
Tiny Scales, Great Challenges: The Limits of Multimodal LLMs in Scale Recognition
Jihang Jin | Ronghao Chen | Hao Zhang | Ziyan Liu | Huacan Wang | Qi Ye | Jingping Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jihang Jin | Ronghao Chen | Hao Zhang | Ziyan Liu | Huacan Wang | Qi Ye | Jingping Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Visual scale recognition is a fundamental aspect for humans to perceive physical quantities in the real world, and it is crucial for enabling human-like intelligence in multimodal large language models (MLLMs). However, existing benchmarks typically focus on a single type of quantity (e.g., time) or a specific format (e.g., dials), lacking a comprehensive evaluation of scale recognition capabilities. To address these problems, we propose ScaleBench, a visual scale recognition benchmark built using images from COCO, Open Images, and Flickr, designed to comprehensively evaluate the scale recognition capabilities of MLLMs. To ensure high data quality, we develop detailed annotation guidelines and procedures, resulting in a total of 6,574 annotated samples. Based on this benchmark, we evaluate multiple closed-source and open-source MLLMs. Experimental results reveal that the best-performing model achieves only 42.60% accuracy, far lower than the 97.40% of humans. Furthermore, we conduct in-depth experimental analyses and provide future research directions. Our benchmark and implementation codes are available at https://github.com/Sonder-hang/ScaleBench.
SafetyMem: Adaptive Jailbreak Defense via Dual-Component Safety Memory
Hao Wang | Ziyi Ni | Huacan Wang | Pin Lyu | Lei Sha
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hao Wang | Ziyi Ni | Huacan Wang | Pin Lyu | Lei Sha
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current defenses for Large Language Models (LLMs) often suffer from a ”memory gap”: parameter-modifying methods are computationally rigid, while inference-time filters cannot retain or reuse defense knowledge across interactions. To address this, we propose SafetyMem, a novel framework that secures LLMs through a dual-component safety memory system. SafetyMem consists of Semantic Safety Memory (SSM), which consolidates diverse jailbreak attempts into a structured knowledge base of attack patterns, and Episodic Safety Memory (ESM), which maintains an evolving set of procedural rules refined from historical detection failures. Unlike static defenses, SafetyMem allows the model to ”remember” and adapt to emerging adversarial strategies without parameter retraining. To further enhance robustness, we introduce an adversarial memory expansion mechanism that proactively generates challenging variants to solidify these memories. Experiments on standard and stealthy jailbreak benchmarks show that SafetyMem substantially reduces attack success rates while preserving efficiency and interpretability, consistently outperforming state-of-the-art baselines across multiple LLMs.
CloneMem: Benchmarking Long-Term Memory for AI Clones
Sen Hu | Zhiyu Zhang | Yuxiang Wei | Xueran Han | Zhenheng Tang | Ronghao Chen | Huacan Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sen Hu | Zhiyu Zhang | Yuxiang Wei | Xueran Han | Zhenheng Tang | Ronghao Chen | Huacan Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
AI Clones aim to simulate an individual’s thoughts and behaviors to enable long-term, personalized interaction, placing stringent demands on memory systems to model experiences, emotions, and opinions over time. Existing memory benchmarks primarily rely on user–agent conversational histories, which are temporally fragmented and insufficient for capturing continuous life trajectories. We introduce CloneMem, a benchmark for evaluating long-term memory in AI Clone scenarios grounded in non-conversational digital traces, including diaries, social media posts, and emails, spanning one to three years. CloneMem adopts a top-down data construction framework to ensure longitudinal coherence and defines tasks that assess an agent’s ability to track evolving personal states. Experiments show that current memory mechanisms struggle in this setting, highlighting open challenges for life-grounded personalized AI. Code and dataset are available at https://github.com/AvatarMemory/CloneMemBench
LiveCANNBench: Benchmark SWE AI Coding for Ascend CANN
Sijie Wang | Kai Zhao | Wee Peng Tay | Shuo Zhang | Chengwen Liu | Quanjiang Guo | Ren Junhao | Xin Li | Heng Lian | Jingdi Lei | Rui She | Huacan Wang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2026
Sijie Wang | Kai Zhao | Wee Peng Tay | Shuo Zhang | Chengwen Liu | Quanjiang Guo | Ren Junhao | Xin Li | Heng Lian | Jingdi Lei | Rui She | Huacan Wang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2026
AI coding has emerged as a core application of large language models (LLMs), evolving from single-file coding tasks towards complex software engineering (SWE) scenarios. Recent advances in agents have enabled multi-file, multi-language, and dependency-aware AI coding, significantly expanding the scope of AI-assisted software development. While a variety of benchmarks have been proposed to evaluate coding capabilities in general-purpose or GPU coding ecosystems such as CUDA and ROCm, systematic evaluation for Huawei Ascend CANN remains largely underexplored. In this work, we propose LiveCANNBench, an SWE-level benchmark designed for AI coding in the CANN software stack. LiveCANNBench is constructed from real-world CANN repositories and consists of over 400 task instances spanning multi-file, multi-language, and execution-aware coding challenges. Unlike existing static benchmarks that primarily focus on kernel-level code generation, LiveCANNBench adopts a live benchmarking paradigm, effectively mitigating data leakage and enabling more reliable evaluation of modern coding agents.
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
Tingyu Wu | Zhisheng Chen | Ziyan Weng | Shuhe Wang | Shuo Zhang | Sen Hu | Silin Wu | Qizhen Lan | Huacan Wang | Ronghao Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tingyu Wu | Zhisheng Chen | Ziyan Weng | Shuhe Wang | Shuo Zhang | Sen Hu | Silin Wu | Qizhen Lan | Huacan Wang | Ronghao Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present Knowme-Bench, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. Knowme-Bench reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval.
MirrorQA: Benchmarking Multimodal LLMs on Mirror-Orientation Reasoning
Jingping Liu | Xingchen Peng | Yan Zhou | Ziyan Liu | Jie Zhai | Ronghao Chen | Huacan Wang | Xiaofeng Jia
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingping Liu | Xingchen Peng | Yan Zhou | Ziyan Liu | Jie Zhai | Ronghao Chen | Huacan Wang | Xiaofeng Jia
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal large language models (MLLMs) have achieved remarkable progress in recent years, yet their ability to perform left–right reasoning in mirror contexts—a fundamental element of spatial cognition—remains underexplored. To address this gap, we introduce MirrorQA, a manually constructed benchmark with 5,549 samples, designed to evaluate MLLMs’ capability to distinguish left from right from a subject-centered perspective. MirrorQA is built through a three-stage pipeline (annotation, verification, and final review) to ensure high-quality labeling. Comprehensive evaluations on both open- and closed-source MLLMs show that even the best-performing models achieve only 65.40% accuracy, far below the 99.28% accuracy of humans. These results highlight substantial challenges in current MLLMs when reasoning about left and right, and point to promising directions for future research. MirrorQA and its code are publicly available at anonymous link https://github.com/stargazer-zeno/MirrorQA.
Does Memory Need Graphs? A Unified Framework and Empirical Analysis for Long-Term Dialog Memory
Sen Hu | Yuxiang Wei | Jiaxin Ran | Xueran Han | Zhiyuan Yao | Huacan Wang | Ronghao Chen | Lei Zou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sen Hu | Yuxiang Wei | Jiaxin Ran | Xueran Han | Zhiyuan Yao | Huacan Wang | Ronghao Chen | Lei Zou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
Haonan Bian | Zhiyuan Yao | Sen Hu | Zishan Xu | Shaolei Zhang | Yifu Guo | Ziliang Yang | Xueran Han | Huacan Wang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2026
Haonan Bian | Zhiyuan Yao | Sen Hu | Zishan Xu | Shaolei Zhang | Yifu Guo | Ziliang Yang | Xueran Han | Huacan Wang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2026
As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture “long-term project-oriented” interactions where agents must track evolving goals. To bridge this gap, we introduce RealMem, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at https://anonymous.4open.science/r/realmem-A1E4.
Search
Fix author
Co-authors
- Ronghao Chen 7
- Sen Hu 4
- Xueran Han 3
- Ziyan Liu 2
- Jingping Liu 2
- Yuxiang Wei 2
- Zhiyuan Yao 2
- Shuo Zhang 2
- Haonan Bian 1
- Zhisheng Chen 1
- Quanjiang Guo 1
- Yifu Guo 1
- Xiaofeng Jia 1
- Jihang Jin 1
- Ren Junhao 1
- Qizhen Lan 1
- Jingdi Lei 1
- Xin Li 1
- Heng Lian 1
- Chengwen Liu 1
- Pin Lyu 1
- Ziyi Ni 1
- Xingchen Peng 1
- Jiaxin Ran 1
- Lei Sha 1
- Rui She 1
- Zhenheng Tang 1
- Wee Peng Tay 1
- Hao Wang 1
- Sijie Wang 1
- Shuhe Wang 1
- Ziyan Weng 1
- Tingyu Wu 1
- Silin Wu 1
- Zishan Xu 1
- Ziliang Yang 1
- Qi Ye 1
- Jie Zhai 1
- Hao Zhang 1
- Zhiyu Zhang 1
- Shaolei Zhang 1
- Kai Zhao 1
- Yan Zhou 1
- Lei Zou 1