Ying Shen
Other people with similar names: Ying Shen, Ying Shen
Unverified author pages with similar names: Ying Shen
2026
Attention Basin: Why Contextual Position Matters in Large Language Models
Zihao Yi | Zhenqing Ling | Delong Zeng | Haohao Luo | Zhe Xu | Wei Liu | Jian Luan | Wanxia Cao | Ying Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zihao Yi | Zhenqing Ling | Delong Zeng | Haohao Luo | Zhe Xu | Wei Liu | Jian Luan | Wanxia Cao | Ying Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.
Invocation Refiner: A Plug-and-Play Module for Rectifying LLM Tool Invocations
Qirui Jiao | Dian Jiao | Nan Du | Ying Shen | Liang Lin
Findings of the Association for Computational Linguistics: ACL 2026
Qirui Jiao | Dian Jiao | Nan Du | Ying Shen | Liang Lin
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) have shown remarkable capabilities in Tool-Integrated Reasoning (TIR). However, the practical application is often hindered by frequent errors in tool invocations, such as incorrect parameters or malformed formats. Prevailing training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), can mitigate these issues but require modification on the base LLM. This lack of modularity necessitates extensive retraining when deploying the system across different base models. To address the limitation, we introduce the Invocation Refiner, a specialized post-processing module designed to enhance the tool-use reliability of base LLMs without directly training on them. The Refiner takes the output from a frozen upstream LLM and the user’s query as input, performing independent reasoning to rectify the invocation. We construct a dedicated training dataset and train this module using an advanced RL algorithm. On a diverse set of tool-use and reasoning benchmarks, our Refiner improves task completion rates and invocation accuracy over the raw outputs of various upstream LLMs. This highlights our Refiner as a plug-and-play solution for improving the operational reliability of LLM-based agents. We release our code to facilitate future research.
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Zihao Yi | Qingxuan Jiang | Ruotian Ma | Xingyu Chen | Qu Yang | Mengru Wang | Fanghua Ye | Ying Shen | Zhaopeng Tu | Xiaolong Li | Liefeng Bo
Findings of the Association for Computational Linguistics: ACL 2026
Zihao Yi | Qingxuan Jiang | Ruotian Ma | Xingyu Chen | Qu Yang | Mengru Wang | Fanghua Ye | Ying Shen | Zhaopeng Tu | Xiaolong Li | Liefeng Bo
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ”Deceitful” and ”Manipulative”, often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
2025
INREACT: An Inspire-Then-Reinforce Training Framework For Multimodal GUI Agent
Yuanlei Wang | Liuzhou Zhang | Haohao Luo | Ying Shen
Findings of the Association for Computational Linguistics: EMNLP 2025
Yuanlei Wang | Liuzhou Zhang | Haohao Luo | Ying Shen
Findings of the Association for Computational Linguistics: EMNLP 2025
Graphical User Interface (GUI) interaction, which aims to develop an intelligent GUI agent that executes user instructions to perform tasks such as installing applications by controlling digital devices, has gained significant attention due to its practical value. Although current advanced multimodal large language models (LLMs) provide GUI agents with robust perception and reasoning capabilities, they often struggle with the precise localization of small elements. To tackle this problem, we propose InReAct, a multimodal GUI agent framework that unifies observing, thinking, and acting for precise and interpretable decision-making. It is trained via a two-stage process: curriculum learning to progressively build perception, grounding, and reasoning abilities, followed by reinforcement learning to refine pixel-level grounding with an outcome-based reward. We introduce a rule-based reward function that jointly optimizes action-type selection and pixel-level localization accuracy. Experimental results on multiple datasets demonstrate the superiority of InReAct in both grounding and navigation tasks.
Enhancing Multimodal Retrieval via Complementary Information Extraction and Alignment
Delong Zeng | Yuexiang Xie | Yaliang Li | Ying Shen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Delong Zeng | Yuexiang Xie | Yaliang Li | Ying Shen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal retrieval has emerged as a promising yet challenging research direction in recent years. Most existing studies in multimodal retrieval focus on capturing information in multimodal data that is similar to their paired texts, but often ignores the complementary information contained in multimodal data. In this study, we propose CIEA, a novel multimodal retrieval approach that employs Complementary Information Extraction and Alignment, which transforms both text and images in documents into a unified latent space and features a complementary information extractor designed to identify and preserve differences in the image representations. We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images. Extensive experiments demonstrate the effectiveness of CIEA, which achieves significant improvements over both divide-and-conquer models and universal dense retrieval models. We provide an ablation study, further discussions, and case studies to highlight the advancements achieved by CIEA. To promote further research in the community, we have released the source code at https://github.com/zengdlong/CIEA.
Browsing Like Human: A Multimodal Web Agent with Experiential Fast-and-Slow Thinking
Haohao Luo | Jiayi Kuang | Wei Liu | Ying Shen | Jian Luan | Yang Deng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haohao Luo | Jiayi Kuang | Wei Liu | Ying Shen | Jian Luan | Yang Deng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automating web navigation which aims to build a web agent that follows user instructions to complete tasks like booking flights by interacting with websites, has received increasing attention due to its practical value. Although existing web agents are mostly equipped with visual perception, planning, and memory abilities, their reasoning process are still deviate from human cognition. In this work, we study the human thought pattern to empower agent with more human-like abilities in web navigation. To tackle this problem, we propose a novel multimodal web agent framework called WebExperT, which is designed to emulate the human planning process of “thinking fast and slow” to effectively decompose complex user instructions. Furthermore, WebExperT leverages experiential learning by reflecting from failure for continuously refining planning and decision-making outcomes. Experimental results on the Mind2Web benchmark demonstrate the superiority of WebExperT in both supervised and unsupervised settings.
Express What You See: Can Multimodal LLMs Decode Visual Ciphers with Intuitive Semiosis Comprehension?
Jiayi Kuang | Yinghui Li | Chen Wang | Haohao Luo | Ying Shen | Wenhao Jiang
Findings of the Association for Computational Linguistics: ACL 2025
Jiayi Kuang | Yinghui Li | Chen Wang | Haohao Luo | Ying Shen | Wenhao Jiang
Findings of the Association for Computational Linguistics: ACL 2025
Bridging the gap between visual and language remains a pivotal challenge for the multimodal community. Traditional VQA benchmarks encounter a modality gap and over-reliance on language priors, whereas human cognition excels at intuitive semiosis, associating abstract visual symbols to linguistic semantics. Inspired by this neurocognitive mechanism, we focus on emojis, the visual cipher conveying abstract textual semantics. Specifically, we propose a novel task of generating abstract linguistics from emoji sequence images, where such reasoning underpins critical applications in cryptography, thus challenging MLLMs’ reasoning of decoding complex semantics of visual ciphers. We introduce eWe-bench (Express What you SeE), assessing MLLMs’ capability of intuitive semiosis like humans. Our data construction framework ensures high visual sensitivity and data quality, which can be extended to future data enhancement. Evaluation results on advanced MLLMs highlight critical deficiencies in visual intuitive symbolic reasoning. We believe our interesting insights for advancing visual semiosis in MLLMs will pave the way for cryptographic analysis and high-level intuitive cognition intelligence of MLLMs.
MKT: A Multi-Stage Knowledge Transfer Framework to Mitigate Catastrophic Forgetting in Multi-Domain Chinese Spelling Correction
Peng Xing | Yinghui Li | Shirong Ma | Xinnian Liang | Haojing Huang | Yangning Li | Shu-Yu Guo | Hai-Tao Zheng | Wenhao Jiang | Ying Shen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Peng Xing | Yinghui Li | Shirong Ma | Xinnian Liang | Haojing Huang | Yangning Li | Shu-Yu Guo | Hai-Tao Zheng | Wenhao Jiang | Ying Shen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in given sentences. Recently, multi-domain CSC has gradually attracted the attention of researchers because it is more practicable. In this paper, we focus on the key flaw of the CSC model when adapting to multi-domain scenarios: the tendency to forget previously acquired knowledge upon learning new domain-specific knowledge (i.e., catastrophic forgetting). To address this, we propose a novel model-agnostic Multi-stage Knowledge Transfer (MKT) framework with an evolving teacher model and dynamic distillation weights for knowledge transfer in each domain, rather than focusing solely on new domain knowledge. It deserves to be mentioned that we are the first to apply continual learning methods to the multi-domain CSC task. Experiments prove our method’s effectiveness over traditional approaches, highlighting the importance of overcoming catastrophic forgetting to enhance model performance.
CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction
Jingheng Ye | Zishan Xu | Yinghui Li | Linlin Song | Qingyu Zhou | Hai-Tao Zheng | Ying Shen | Wenhao Jiang | Hong-Gee Kim | Ruitong Liu | Xin Su | Zifei Shan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingheng Ye | Zishan Xu | Yinghui Li | Linlin Song | Qingyu Zhou | Hai-Tao Zheng | Ying Shen | Wenhao Jiang | Hong-Gee Kim | Ruitong Liu | Xin Su | Zifei Shan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The paper focuses on the interpretability of Grammatical Error Correction (GEC) evaluation metrics, which received little attention in previous studies. To bridge the gap, we introduce **CLEME2.0**, a reference-based metric describing four fundamental aspects of GEC systems: hit-correction, wrong-correction, under-correction, and over-correction. They collectively contribute to exposing critical qualities and locating drawbacks of GEC systems. Evaluating systems by combining these aspects also leads to superior human consistency over other reference-based and reference-less metrics. Extensive experiments on two human judgment datasets and six reference datasets demonstrate the effectiveness and robustness of our method, achieving a new state-of-the-art result. Our codes are released at https://github.com/THUKElab/CLEME.
Search
Fix author
Co-authors
- Haohao Luo 4
- Wenhao Jiang 3
- Yinghui Li 3
- Jiayi Kuang 2
- Wei Liu 2
- Jian Luan 2
- Zihao Yi 2
- Delong Zeng 2
- Hai-Tao Zheng 2
- Liefeng Bo 1
- Wanxia Cao 1
- Xingyu Chen 1
- Yang Deng 1
- Nan Du 1
- Shu-Yu Guo 1
- Haojing Huang 1
- Qingxuan Jiang 1
- Qirui Jiao 1
- Dian Jiao 1
- Hong-Gee Kim 1
- Yaliang Li 1
- Yangning Li 1
- Xiaolong Li 1
- Xinnian Liang 1
- Liang Lin 1
- Zhenqing Ling 1
- Ruitong Liu 1
- Shirong Ma 1
- Ruotian Ma 1
- Zifei Shan 1
- Linlin Song 1
- Xin Su 1
- Zhaopeng Tu 1
- Yuanlei Wang 1
- Chen Wang 1
- Mengru Wang 1
- Yuexiang Xie 1
- Peng Xing 1
- Zhe Xu 1
- Zishan Xu 1
- Qu Yang 1
- Jingheng Ye 1
- Fanghua Ye 1
- Liuzhou Zhang 1
- Qingyu Zhou 1