Ronghao Chen
2026
Tiny Scales, Great Challenges: The Limits of Multimodal LLMs in Scale Recognition
Jihang Jin | Ronghao Chen | Hao Zhang | Ziyan Liu | Huacan Wang | Qi Ye | Jingping Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jihang Jin | Ronghao Chen | Hao Zhang | Ziyan Liu | Huacan Wang | Qi Ye | Jingping Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Visual scale recognition is a fundamental aspect for humans to perceive physical quantities in the real world, and it is crucial for enabling human-like intelligence in multimodal large language models (MLLMs). However, existing benchmarks typically focus on a single type of quantity (e.g., time) or a specific format (e.g., dials), lacking a comprehensive evaluation of scale recognition capabilities. To address these problems, we propose ScaleBench, a visual scale recognition benchmark built using images from COCO, Open Images, and Flickr, designed to comprehensively evaluate the scale recognition capabilities of MLLMs. To ensure high data quality, we develop detailed annotation guidelines and procedures, resulting in a total of 6,574 annotated samples. Based on this benchmark, we evaluate multiple closed-source and open-source MLLMs. Experimental results reveal that the best-performing model achieves only 42.60% accuracy, far lower than the 97.40% of humans. Furthermore, we conduct in-depth experimental analyses and provide future research directions. Our benchmark and implementation codes are available at https://github.com/Sonder-hang/ScaleBench.
CloneMem: Benchmarking Long-Term Memory for AI Clones
Sen Hu | Zhiyu Zhang | Yuxiang Wei | Xueran Han | Zhenheng Tang | Ronghao Chen | Huacan Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sen Hu | Zhiyu Zhang | Yuxiang Wei | Xueran Han | Zhenheng Tang | Ronghao Chen | Huacan Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
AI Clones aim to simulate an individual’s thoughts and behaviors to enable long-term, personalized interaction, placing stringent demands on memory systems to model experiences, emotions, and opinions over time. Existing memory benchmarks primarily rely on user–agent conversational histories, which are temporally fragmented and insufficient for capturing continuous life trajectories. We introduce CloneMem, a benchmark for evaluating long-term memory in AI Clone scenarios grounded in non-conversational digital traces, including diaries, social media posts, and emails, spanning one to three years. CloneMem adopts a top-down data construction framework to ensure longitudinal coherence and defines tasks that assess an agent’s ability to track evolving personal states. Experiments show that current memory mechanisms struggle in this setting, highlighting open challenges for life-grounded personalized AI. Code and dataset are available at https://github.com/AvatarMemory/CloneMemBench
LiveCANNBench: Benchmark SWE AI Coding for Ascend CANN
Sijie Wang | Kai Zhao | Wee Peng Tay | Shuo Zhang | Chengwen Liu | Quanjiang Guo | Ren Junhao | Xin Li | Heng Lian | Jingdi Lei | Rui She | Huacan Wang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2026
Sijie Wang | Kai Zhao | Wee Peng Tay | Shuo Zhang | Chengwen Liu | Quanjiang Guo | Ren Junhao | Xin Li | Heng Lian | Jingdi Lei | Rui She | Huacan Wang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2026
AI coding has emerged as a core application of large language models (LLMs), evolving from single-file coding tasks towards complex software engineering (SWE) scenarios. Recent advances in agents have enabled multi-file, multi-language, and dependency-aware AI coding, significantly expanding the scope of AI-assisted software development. While a variety of benchmarks have been proposed to evaluate coding capabilities in general-purpose or GPU coding ecosystems such as CUDA and ROCm, systematic evaluation for Huawei Ascend CANN remains largely underexplored. In this work, we propose LiveCANNBench, an SWE-level benchmark designed for AI coding in the CANN software stack. LiveCANNBench is constructed from real-world CANN repositories and consists of over 400 task instances spanning multi-file, multi-language, and execution-aware coding challenges. Unlike existing static benchmarks that primarily focus on kernel-level code generation, LiveCANNBench adopts a live benchmarking paradigm, effectively mitigating data leakage and enabling more reliable evaluation of modern coding agents.
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
Tingyu Wu | Zhisheng Chen | Ziyan Weng | Shuhe Wang | Shuo Zhang | Sen Hu | Silin Wu | Qizhen Lan | Huacan Wang | Ronghao Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tingyu Wu | Zhisheng Chen | Ziyan Weng | Shuhe Wang | Shuo Zhang | Sen Hu | Silin Wu | Qizhen Lan | Huacan Wang | Ronghao Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present Knowme-Bench, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. Knowme-Bench reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval.
MirrorQA: Benchmarking Multimodal LLMs on Mirror-Orientation Reasoning
Jingping Liu | Xingchen Peng | Yan Zhou | Ziyan Liu | Jie Zhai | Ronghao Chen | Huacan Wang | Xiaofeng Jia
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingping Liu | Xingchen Peng | Yan Zhou | Ziyan Liu | Jie Zhai | Ronghao Chen | Huacan Wang | Xiaofeng Jia
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal large language models (MLLMs) have achieved remarkable progress in recent years, yet their ability to perform left–right reasoning in mirror contexts—a fundamental element of spatial cognition—remains underexplored. To address this gap, we introduce MirrorQA, a manually constructed benchmark with 5,549 samples, designed to evaluate MLLMs’ capability to distinguish left from right from a subject-centered perspective. MirrorQA is built through a three-stage pipeline (annotation, verification, and final review) to ensure high-quality labeling. Comprehensive evaluations on both open- and closed-source MLLMs show that even the best-performing models achieve only 65.40% accuracy, far below the 99.28% accuracy of humans. These results highlight substantial challenges in current MLLMs when reasoning about left and right, and point to promising directions for future research. MirrorQA and its code are publicly available at anonymous link https://github.com/stargazer-zeno/MirrorQA.
Does Memory Need Graphs? A Unified Framework and Empirical Analysis for Long-Term Dialog Memory
Sen Hu | Yuxiang Wei | Jiaxin Ran | Xueran Han | Zhiyuan Yao | Huacan Wang | Ronghao Chen | Lei Zou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sen Hu | Yuxiang Wei | Jiaxin Ran | Xueran Han | Zhiyuan Yao | Huacan Wang | Ronghao Chen | Lei Zou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
Haonan Bian | Zhiyuan Yao | Sen Hu | Zishan Xu | Shaolei Zhang | Yifu Guo | Ziliang Yang | Xueran Han | Huacan Wang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2026
Haonan Bian | Zhiyuan Yao | Sen Hu | Zishan Xu | Shaolei Zhang | Yifu Guo | Ziliang Yang | Xueran Han | Huacan Wang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2026
As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture “long-term project-oriented” interactions where agents must track evolving goals. To bridge this gap, we introduce RealMem, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at https://anonymous.4open.science/r/realmem-A1E4.
2025
ALRPHFS: Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning for Robust Agent Defense
Shiyu Xiang | Tong Zhang | Ronghao Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Shiyu Xiang | Tong Zhang | Ronghao Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
LLM Agents are becoming central to intelligent systems. However, their deployment raises serious safety concerns. Existing defenses largely rely on “Safety Checks”, which struggle to capture the complex semantic risks posed by harmful user inputs or unsafe agent behaviors—creating a significant semantic gap between safety checks and real-world risks. To bridge this gap, we propose a novel defense framework, ALRPHFS (Adversarially Learned Risk Patterns with Hierarchical Fast & Slow Reasoning). ALRPHFS consists of two core components: (1) an offline adversarial self-learning loop to iteratively refine a generalizable and balanced library of risk patterns, substantially enhancing robustness without retraining the base LLM, and (2) an online hierarchical fast & slow reasoning engine that balances detection effectiveness with computational efficiency. Experimental results demonstrate that our approach achieves superior overall performance compared to existing baselines, achieving a best‐in‐class average accuracy of 80% and exhibiting strong generalizability across agents and tasks.
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Shiyu Xiang | Ansen Zhang | Yanfei Cao | Fan Yang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2025
Shiyu Xiang | Ansen Zhang | Yanfei Cao | Fan Yang | Ronghao Chen
Findings of the Association for Computational Linguistics: ACL 2025
Although Aligned Large Language Models (LLMs) are trained to reject harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying “attack essences” remain the same. To address this issue, we introduce EDDF, an Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the “attack essence” from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20%, underscoring its superior robustness against jailbreak attacks.
Search
Fix author
Co-authors
- Huacan Wang 7
- Sen Hu 4
- Xueran Han 3
- Ziyan Liu 2
- Jingping Liu 2
- Yuxiang Wei 2
- Shiyu Xiang 2
- Zhiyuan Yao 2
- Shuo Zhang 2
- Haonan Bian 1
- Yanfei Cao 1
- Zhisheng Chen 1
- Quanjiang Guo 1
- Yifu Guo 1
- Xiaofeng Jia 1
- Jihang Jin 1
- Ren Junhao 1
- Qizhen Lan 1
- Jingdi Lei 1
- Xin Li 1
- Heng Lian 1
- Chengwen Liu 1
- Xingchen Peng 1
- Jiaxin Ran 1
- Rui She 1
- Zhenheng Tang 1
- Wee Peng Tay 1
- Sijie Wang 1
- Shuhe Wang 1
- Ziyan Weng 1
- Tingyu Wu 1
- Silin Wu 1
- Zishan Xu 1
- Fan Yang 1
- Ziliang Yang 1
- Qi Ye 1
- Jie Zhai 1
- Hao Zhang 1
- Tong Zhang 1
- Zhiyu Zhang 1
- Ansen Zhang 1
- Shaolei Zhang 1
- Kai Zhao 1
- Yan Zhou 1
- Lei Zou 1