Xinyu Yang

2026

Code agents are increasingly trusted to autonomously fix bugs on platforms such as GitHub, yet their security evaluation focuses almost exclusively on functional correctness. In this paper, we reveal a novel type of threat to real-world code-agents: functionally correct yet vulnerable (FCV) patches, which pass all test cases but contain vulnerable code. With our proposed FCV-Attack, we demonstrate that SOTA LLMs (e.g., ChatGPT and Claude) and agent scaffolds (e.g., SWE-agent and OpenHands) are all vulnerable to this FCV threat; across 12 agent-model combinations on SWE-Bench, the attack only requires black-box access and a single query to the code agent to perform the attack. For example, for CWE-538 (information exposure vulnerability), the FCV-Attack attains an attack success rate of 40.7% on GPT-5 Mini + OpenHands. Our results reveal an important security threat overlooked by current evaluation paradigms and urge the development of security-aware defenses for code agents.

pdf bib abs

Large language models (LLMs) have demonstrated strong performance and rapid progress in a wide range of medical reasoning tasks.However, their sequential autoregressive decoding forces inherently parallel clinical reasoning, such as differential diagnosis, into a single linear reasoning path, limiting both efficiency and reliability for complex medical problems.To address this, we propose MedVerse, a reasoning framework for complex medical inference that reformulates medical reasoning as a parallelizable directed acyclic graph (DAG) process based on Petri Net theory.The framework adopts a full-stack design across data, model architecture, and system execution.For data creation, we introduce the MedVerse Curator, an automated pipeline that synthesizes knowledge-grounded medical reasoning path and transforms them into Petri Net–structured representations.At the architectural level, we propose a topology-aware attention mechanism with adaptive position indices that supports parallel reasoning while preserving logical consistency.Systematically, we develop a customized inference engine that supports parallel execution without additional overhead.Empirical evaluations show that MedVerse improves strong general-purpose LLMs by up to 8.9%. Compared to specialized medical LLMs, MedVerse achieves comparable performance with improved clinical reliability, while delivering a 1.3× reduction in inference latency and a 1.7× increase in generation throughput, enabled by its parallel decoding capability.

pdf bib abs

GLARE: Agentic Reasoning for Legal Judgment Prediction
Xinyu Yang | Chenlong Deng | Zhicheng Dou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Legal judgment prediction serves as a pivotal task in intelligent judicial systems. Although large language models have achieved remarkable progress in general reasoning, they struggle with tasks that require fine-grained distinctions between similar charges. These models often select plausible charges directly without discriminating among closely related alternatives. In this paper, we introduce GLARE, an agentic legal reasoning framework that enables models to actively retrieve and apply external knowledge during decision-making. Unlike static prediction, GLARE simulates comparative reasoning by dynamically expanding the decision space to include confusing candidates, then retrieving exclusionary logic from precedents and statutes to identify the correct judgment. Experiments on real-world datasets show that our method significantly outperforms strong baselines, especially on complex cases involving confusing or rare charges. The code is available at https://anonymous.4open.science/r/GLARE-LJP-8EDF.

pdf bib abs

Although the Universal Transformer (UT) mitigates the diminishing returns of standard LLM scaling by decoupling parameter count from depth, it remains constrained by linear computational costs and rigid weight-sharing mechanisms. These limitations lead to severe functional homogeneity, which subsequently induces over-smoothing, representation rank collapse, and degraded reasoning performance. In this work, we present the first systematic study of Compute Distribution Skew, identifying it as the primary driver of extrapolation failure. This is a pathological phenomenon in ultra-deep recurrent Transformers characterized by a disproportionate distribution of contributions across recurrent steps, resulting in distinct functional states during prefix and suffix processing phases. To address this challenge, we propose the Polymorphic Transformer, which aims to achieve functional polymorphism and depth sparsity within a shared-parameter framework. By integrating conditional sparse subspaces, SiLU Attention, and an uncertainty-aware depth scheduler, our architecture mitigates power-method collapse and effectively decouples logical depth from computational cost. Experiments demonstrate that our model significantly enhances representation rank and robustness, achieving complex reasoning performance comparable to baseline while reducing computation by 64.7%.

2025

pdf bib abs

System Report for CCL25-Eval Task 4: Prompting, Scheduling, and Arbitration Strategies for Chinese Factivity Inference
Liu Daohuan | Xia Lun | Yuxuan Zhang | Xinyu Yang | Fanzhen Kong
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

This report presents the methodology and findings of prompting large language models (LLMs) for Chinese Factivity Inference (FI). We evaluated five LLMs, among which DeepSeek-R1 demonstrated the best overall performance. A combination of Chain-of-Thought (CoT), few-shot, and system-level instructions were combined for final prompting. Additionally, we introduced a pairwise task scheduling strategy and a multi-agent disagreement arbitration mechanism to further enhance inference quality. Experimental results show that the integration of prompting, scheduling, and arbitration strategies significantly improves performance, with DeepSeek-R1 achieving 91.7% overall accuracy on the evaluation set. The report also highlights findings regarding LLM behavior on FI tasks and outlines potential directions for future improvement.

pdf bib abs

Debiasing the Fine-Grained Classification Task in LLMs with Bias-Aware PEFT
Daiying Zhao | Xinyu Yang | Hang Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Fine-grained classification via LLMs is susceptible to more complex label biases compared to traditional classification tasks. Existing bias mitigation strategies, such as retraining, post-hoc adjustment, and parameter-efficient fine-tuning (PEFT) are primarily effective for simple classification biases, such as stereotypes, but fail to adequately address prediction propensity and discriminative ability biases. In this paper, we analyze these two bias phenomena and observe their progressive accumulation from intermediate to deeper layers within LLMs. To mitigate this issue, we propose a bias-aware optimization framework that incorporates two distinct label balance constraints with a PEFT strategy targeting an intermediate layer. Our approach adjusts less than 1% of the model’s parameters while effectively curbing bias amplification in deeper layers. Extensive experiments conducted across 12 datasets and 5 LLMs demonstrate that our method consistently outperforms or matches the performance of full-parameter fine-tuning and LoRA, achieving superior results with lower perplexity.

pdf bib abs

Quantifying Semantic Emergence in Language Models
Hang Chen | Xinyu Yang | Jiaying Zhu | Wenya Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) are widely recognized for their exceptional capacity to capture semantics meaning. Yet, there remains no established metric to quantify this capability. In this work, we introduce a quantitative metric, Information Emergence (IE), designed to measure LLMs’ ability to extract semantics from input tokens. We formalize “semantics” as the meaningful information abstracted from a sequence of tokens and quantify this by comparing the entropy reduction observed for a sequence of tokens (macro-level) and individual tokens (micro-level). To achieve this, we design a lightweight estimator to compute the mutual information at each transformer layer, which is agnostic to different tasks and language model architectures. We apply IE in both synthetic in-context learning (ICL) scenarios and natural sentence contexts. Experiments demonstrate informativeness and patterns about semantics. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights.

2023

pdf bib abs

How to Enhance Causal Discrimination of Utterances: A Case on Affective Reasoning
Hang Chen | Xinyu Yang | Jing Luo | Wenjing Zhu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Our investigation into the Affective Reasoning in Conversation (ARC) task highlights the challenge of causal discrimination. Almost all existing models, including large language models (LLMs), excel at capturing semantic correlations within utterance embeddings but fall short in determining the specific causal relationships. To overcome this limitation, we propose the incorporation of i.i.d. noise terms into the conversation process, thereby constructing a structural causal model (SCM). It explores how distinct causal relationships of fitted embeddings can be discerned through independent conditions. To facilitate the implementation of deep learning, we introduce the cogn frameworks to handle unstructured conversation data, and employ an autoencoder architecture to regard the unobservable noise as learnable “implicit causes.” Moreover, we curate a synthetic dataset that includes i.i.d. noise. Through comprehensive experiments, we validate the effectiveness and interpretability of our approach. Our code is available in https://github.com/Zodiark-ch/mater-of-our-EMNLP2023-paper.