Ying Shen

Other people with similar names: Ying Shen, Ying Shen

Unverified author pages with similar names: Ying Shen


2026

The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.
Large Language Models (LLMs) have shown remarkable capabilities in Tool-Integrated Reasoning (TIR). However, the practical application is often hindered by frequent errors in tool invocations, such as incorrect parameters or malformed formats. Prevailing training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), can mitigate these issues but require modification on the base LLM. This lack of modularity necessitates extensive retraining when deploying the system across different base models. To address the limitation, we introduce the Invocation Refiner, a specialized post-processing module designed to enhance the tool-use reliability of base LLMs without directly training on them. The Refiner takes the output from a frozen upstream LLM and the user’s query as input, performing independent reasoning to rectify the invocation. We construct a dedicated training dataset and train this module using an advanced RL algorithm. On a diverse set of tool-use and reasoning benchmarks, our Refiner improves task completion rates and invocation accuracy over the raw outputs of various upstream LLMs. This highlights our Refiner as a plug-and-play solution for improving the operational reliability of LLM-based agents. We release our code to facilitate future research.
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ”Deceitful” and ”Manipulative”, often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

2025

Graphical User Interface (GUI) interaction, which aims to develop an intelligent GUI agent that executes user instructions to perform tasks such as installing applications by controlling digital devices, has gained significant attention due to its practical value. Although current advanced multimodal large language models (LLMs) provide GUI agents with robust perception and reasoning capabilities, they often struggle with the precise localization of small elements. To tackle this problem, we propose InReAct, a multimodal GUI agent framework that unifies observing, thinking, and acting for precise and interpretable decision-making. It is trained via a two-stage process: curriculum learning to progressively build perception, grounding, and reasoning abilities, followed by reinforcement learning to refine pixel-level grounding with an outcome-based reward. We introduce a rule-based reward function that jointly optimizes action-type selection and pixel-level localization accuracy. Experimental results on multiple datasets demonstrate the superiority of InReAct in both grounding and navigation tasks.
Multimodal retrieval has emerged as a promising yet challenging research direction in recent years. Most existing studies in multimodal retrieval focus on capturing information in multimodal data that is similar to their paired texts, but often ignores the complementary information contained in multimodal data. In this study, we propose CIEA, a novel multimodal retrieval approach that employs Complementary Information Extraction and Alignment, which transforms both text and images in documents into a unified latent space and features a complementary information extractor designed to identify and preserve differences in the image representations. We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images. Extensive experiments demonstrate the effectiveness of CIEA, which achieves significant improvements over both divide-and-conquer models and universal dense retrieval models. We provide an ablation study, further discussions, and case studies to highlight the advancements achieved by CIEA. To promote further research in the community, we have released the source code at https://github.com/zengdlong/CIEA.
Automating web navigation which aims to build a web agent that follows user instructions to complete tasks like booking flights by interacting with websites, has received increasing attention due to its practical value. Although existing web agents are mostly equipped with visual perception, planning, and memory abilities, their reasoning process are still deviate from human cognition. In this work, we study the human thought pattern to empower agent with more human-like abilities in web navigation. To tackle this problem, we propose a novel multimodal web agent framework called WebExperT, which is designed to emulate the human planning process of “thinking fast and slow” to effectively decompose complex user instructions. Furthermore, WebExperT leverages experiential learning by reflecting from failure for continuously refining planning and decision-making outcomes. Experimental results on the Mind2Web benchmark demonstrate the superiority of WebExperT in both supervised and unsupervised settings.
Bridging the gap between visual and language remains a pivotal challenge for the multimodal community. Traditional VQA benchmarks encounter a modality gap and over-reliance on language priors, whereas human cognition excels at intuitive semiosis, associating abstract visual symbols to linguistic semantics. Inspired by this neurocognitive mechanism, we focus on emojis, the visual cipher conveying abstract textual semantics. Specifically, we propose a novel task of generating abstract linguistics from emoji sequence images, where such reasoning underpins critical applications in cryptography, thus challenging MLLMs’ reasoning of decoding complex semantics of visual ciphers. We introduce eWe-bench (Express What you SeE), assessing MLLMs’ capability of intuitive semiosis like humans. Our data construction framework ensures high visual sensitivity and data quality, which can be extended to future data enhancement. Evaluation results on advanced MLLMs highlight critical deficiencies in visual intuitive symbolic reasoning. We believe our interesting insights for advancing visual semiosis in MLLMs will pave the way for cryptographic analysis and high-level intuitive cognition intelligence of MLLMs.
Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in given sentences. Recently, multi-domain CSC has gradually attracted the attention of researchers because it is more practicable. In this paper, we focus on the key flaw of the CSC model when adapting to multi-domain scenarios: the tendency to forget previously acquired knowledge upon learning new domain-specific knowledge (i.e., catastrophic forgetting). To address this, we propose a novel model-agnostic Multi-stage Knowledge Transfer (MKT) framework with an evolving teacher model and dynamic distillation weights for knowledge transfer in each domain, rather than focusing solely on new domain knowledge. It deserves to be mentioned that we are the first to apply continual learning methods to the multi-domain CSC task. Experiments prove our method’s effectiveness over traditional approaches, highlighting the importance of overcoming catastrophic forgetting to enhance model performance.
The paper focuses on the interpretability of Grammatical Error Correction (GEC) evaluation metrics, which received little attention in previous studies. To bridge the gap, we introduce **CLEME2.0**, a reference-based metric describing four fundamental aspects of GEC systems: hit-correction, wrong-correction, under-correction, and over-correction. They collectively contribute to exposing critical qualities and locating drawbacks of GEC systems. Evaluating systems by combining these aspects also leads to superior human consistency over other reference-based and reference-less metrics. Extensive experiments on two human judgment datasets and six reference datasets demonstrate the effectiveness and robustness of our method, achieving a new state-of-the-art result. Our codes are released at https://github.com/THUKElab/CLEME.