Yuyao Ge
2026
Are All Prompt Components Value-Neutral? Understanding the Heterogeneous Adversarial Robustness of Dissected Prompt in LLMs
Yujia Zheng | Tianhao Li | Haotian Huang | Tianyu Zeng | Jingyu Lu | Chuangxin Chu | Yuekai Huang | Ziyou Jiang | Qian Xiong | Yuyao Ge | Mingyang Li
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Yujia Zheng | Tianhao Li | Haotian Huang | Tianyu Zeng | Jingyu Lu | Chuangxin Chu | Yuekai Huang | Ziyou Jiang | Qian Xiong | Yuyao Ge | Mingyang Li
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Prompt-based adversarial attacks are a key tool for assessing the robustness of large language models (LLMs). Yet, existing studies typically treat prompts as flat text, overlooking their internal structure, different components within a prompt contribute unequally to robustness. This work introduces PromptAnatomy, a framework that decomposes prompts into functional components, and ComPerturb, a controlled perturbation method that selectively modifies these components to expose component-wise vulnerabilities while ensuring linguistic plausibility via perplexity-based filtering. Using this framework, four instruction-tuning datasets are structurally annotated and validated by human reviewers. Experiments across five advanced LLMs show that ComPerturb achieves state-of-the-art attack success rates, while ablation analyses confirm the complementary effects of prompt dissection and perplexity filtering. These results highlight the importance of structural awareness in evaluating and improving the adversarial robustness of LLMs.
a1: Steep Test-time Scaling Law via Environment Augmented Generation
Lingrui Mei | Shenghua Liu | Yiwei Wang | Baolong Bi | Yuyao Ge | Jun Wan | Yurong Wu | Xueqi Cheng
Findings of the Association for Computational Linguistics: ACL 2026
Lingrui Mei | Shenghua Liu | Yiwei Wang | Baolong Bi | Yuyao Ge | Jun Wan | Yurong Wu | Xueqi Cheng
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) have made remarkable breakthroughs in reasoning, yet continue to struggle with hallucinations, logical errors, and inability to self-correct during complex multi-step tasks. Current approaches like chain-of-thought prompting offer limited reasoning capabilities that fail when precise step validation is required. We propose Environment Augmented Generation (EAG), a framework that enhances LLM reasoning through: (1) real-time environmental feedback validating each reasoning step, (2) dynamic branch exploration for investigating alternative solution paths when faced with errors, and (3) experience-based learning from successful reasoning trajectories. Unlike existing methods, EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback with branching exploration. Our a1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks, matching larger models like o1 on competition mathematics while outperforming comparable models by up to 24.4 percentage points. Analysis reveals EAG’s distinctive scaling pattern: initial token investment in environment interaction yields substantial long-term performance dividends, with advantages amplifying proportionally to task complexity.
Lost in Decomposition: Analyzing and Mitigating the Limitations of Long Context Methods via Context Dependency
Jiayuan Guo | Yueyang Su | Yuyao Ge | Saiping Guan | Lei Yu | Jiafeng Guo | Xueqi Cheng
Findings of the Association for Computational Linguistics: ACL 2026
Jiayuan Guo | Yueyang Su | Yuyao Ge | Saiping Guan | Lei Yu | Jiafeng Guo | Xueqi Cheng
Findings of the Association for Computational Linguistics: ACL 2026
Long context large language models exhibit the “lost in the middle” problem, where models struggle to effectively utilize information located in the middle of long contexts. Although existing workflow-based long context methods (e.g., RAG) alleviate this problem and perform well on specific datasets, can their effectiveness generalize to all types of datasets? In this work, we systematically investigate the cross-dataset generalization of long context methods. Our evaluation reveals that these methods are not universally effective. Such substantial performance variability underscores the risks of performance degradation associated with the indiscriminate application of long context methods. We investigated the reason for the failure of long context methods. We found that the intrinsic decomposition mechanisms of long context methods hinder context dependency modeling, causing these methods to suffer performance declines on documents with strong context dependency. To address this issue, We propose CoDaR (**Co**ntext **D**ependency-**a**ware **R**outing), a training-free adaptive routing strategy. By analyzing the context dependency strength of documents, CoDaR adaptively invokes long context methods, thereby significantly enhancing their overall robustness across different types of datasets.
Gated Differentiable Working Memory for Long-Context Language Modeling
Lingrui Mei | Shenghua Liu | Yiwei Wang | Yuyao Ge | Baolong Bi | Jiayu Yao | Jun Wan | Ziling Yin | Jiafeng Guo | Xueqi Cheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lingrui Mei | Shenghua Liu | Yiwei Wang | Yuyao Ge | Baolong Bi | Jiayu Yao | Jun Wan | Ziling Yin | Jiafeng Guo | Xueqi Cheng
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long contexts break transformers: attention scores dilute across thousands of tokens, critical information gets lost in the middle, and the model cannot adapt to novel patterns at inference time. Recent work on test-time adaptation addresses this by maintaining a form of working memory—transient parameters updated on the current context—but existing approaches employ uniform write policies that waste computation on low-value regions and suffer from high gradient variance across semantically heterogeneous contexts. In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, asking: given limited computational budget, which parts of the context should be consolidated into working memory? We propose GDWM (Gated Differentiable Working Memory), a framework that introduces a Write Controller to gate the memory consolidation process. Our controller estimates Contextual Utility—an information-theoretic measure quantifying how much each region depends on long-range context—and allocates gradient steps accordingly, subject to a coverage constraint that ensures global representation. Theoretically, we prove that our chunk-restricted sampling strategy reduces gradient variance by eliminating inter-chunk variance via the Law of Total Variance. Experiments on ZeroSCROLLS and LongBench v2 benchmarks demonstrate that GDWM achieves comparable or superior performance with 4 ×fewer gradient steps compared to uniform baselines—excelling on sparse-information tasks (+6–13% on Qasper, +5–13% on GovReport for smaller models) while revealing principled trade-offs on dense-coverage tasks, establishing a new efficiency-performance Pareto frontier for test-time adaptation.
2025
Can Graph Descriptive Order Affect Solving Graph Problems with LLMs?
Yuyao Ge | Shenghua Liu | Baolong Bi | Yiwei Wang | Lingrui Mei | Wenjie Feng | Lizhe Chen | Xueqi Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuyao Ge | Shenghua Liu | Baolong Bi | Yiwei Wang | Lingrui Mei | Wenjie Feng | Lizhe Chen | Xueqi Cheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved significant success in reasoning tasks, including mathematical reasoning and logical deduction. Among these reasoning tasks, graph problems stand out due to their complexity and unique structural characteristics, attracting considerable attention from researchers. Previous studies have explored LLMs’ graph reasoning abilities through various techniques, such as different encoding methods for graph structures and the use of carefully designed prompts. However, a critical factor has been mostly overlooked: the prompt sequential order in which graph descriptions are presented to the models. In this study, we present the first comprehensive analysis of how the order of graph descriptions impacts LLM performance. Specifically, we comprehensively evaluate four graph description orders across six graph problems using six mainstream LLMs. The results reveal that: (1) ordered graph descriptions significantly improve LLMs’ comprehension of graph structures; (2) the robustness of LLMs to graph description order varies across different tasks; and (3) the impact of graph order on performance is closely related to the inherent characteristics of tasks. This study provides a critical advancement in the application of LLMs for solving graph-related problems, paving the way for future research to optimize model performance through strategic graph description ordering.
Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation
Jiayu Yao | Shenghua Liu | Yiwei Wang | Lingrui Mei | Baolong Bi | Yuyao Ge | Zhecheng Li | Xueqi Cheng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jiayu Yao | Shenghua Liu | Yiwei Wang | Lingrui Mei | Baolong Bi | Yuyao Ge | Zhecheng Li | Xueqi Cheng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal Retrieval-Augmented Generation (RAG) systems have become essential in knowledge-intensive and open-domain tasks. As retrieval complexity increases, ensuring the robustness of these systems is critical. However, current RAG models are highly sensitive to the order in which evidence is presented, often resulting in unstable performance and biased reasoning, particularly as the number of retrieved items or modality diversity grows. This raises a central question: How does the position of retrieved evidence affect multimodal RAG performance? To answer this, we present the first comprehensive study of position bias in multimodal RAG systems. Through controlled experiments across text-only, image-only, and mixed-modality tasks, we observe a consistent U-shaped accuracy curve with respect to evidence position. To quantify this bias, we introduce the Position Sensitivity Index (PSIp) and develop a visualization framework to trace attention allocation patterns across decoder layers. Our results reveal that multimodal interactions intensify position bias compared to unimodal settings, and that this bias increases logarithmically with retrieval range. These findings offer both theoretical and empirical foundations for position-aware analysis in RAG, highlighting the need for evidence reordering or debiasing strategies to build more reliable and equitable generation systems. Our code and experimental resources are available at https://github.com/Theodyy/Multimodal-Rag-Position-Bias.