Xinyu Yang
2026
When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?
Yibo Peng | James Song | Lei Li | Xinyu Yang | Mihai Christodorescu | Ravi Mangal | Corina S. Pasareanu | Haizhong Zheng | Beidi Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yibo Peng | James Song | Lei Li | Xinyu Yang | Mihai Christodorescu | Ravi Mangal | Corina S. Pasareanu | Haizhong Zheng | Beidi Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code agents are increasingly trusted to autonomously fix bugs on platforms such as GitHub, yet their security evaluation focuses almost exclusively on functional correctness. In this paper, we reveal a novel type of threat to real-world code-agents: functionally correct yet vulnerable (FCV) patches, which pass all test cases but contain vulnerable code. With our proposed FCV-Attack, we demonstrate that SOTA LLMs (e.g., ChatGPT and Claude) and agent scaffolds (e.g., SWE-agent and OpenHands) are all vulnerable to this FCV threat; across 12 agent-model combinations on SWE-Bench, the attack only requires black-box access and a single query to the code agent to perform the attack. For example, for CWE-538 (information exposure vulnerability), the FCV-Attack attains an attack success rate of 40.7% on GPT-5 Mini + OpenHands. Our results reveal an important security threat overlooked by current evaluation paradigms and urge the development of security-aware defenses for code agents.
MedVerse: Efficient and Reliable Medical Reasoning via DAG-Structured Parallel Execution
Jianwen Chen | Xinyu Yang | Peng Xia | Arian Azarang | Yueh Z Lee | Gang Li | Hongtu Zhu | Yun Li | Beidi Chen | Huaxiu Yao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianwen Chen | Xinyu Yang | Peng Xia | Arian Azarang | Yueh Z Lee | Gang Li | Hongtu Zhu | Yun Li | Beidi Chen | Huaxiu Yao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have demonstrated strong performance and rapid progress in a wide range of medical reasoning tasks.However, their sequential autoregressive decoding forces inherently parallel clinical reasoning, such as differential diagnosis, into a single linear reasoning path, limiting both efficiency and reliability for complex medical problems.To address this, we propose MedVerse, a reasoning framework for complex medical inference that reformulates medical reasoning as a parallelizable directed acyclic graph (DAG) process based on Petri Net theory.The framework adopts a full-stack design across data, model architecture, and system execution.For data creation, we introduce the MedVerse Curator, an automated pipeline that synthesizes knowledge-grounded medical reasoning path and transforms them into Petri Net–structured representations.At the architectural level, we propose a topology-aware attention mechanism with adaptive position indices that supports parallel reasoning while preserving logical consistency.Systematically, we develop a customized inference engine that supports parallel execution without additional overhead.Empirical evaluations show that MedVerse improves strong general-purpose LLMs by up to 8.9%. Compared to specialized medical LLMs, MedVerse achieves comparable performance with improved clinical reliability, while delivering a 1.3× reduction in inference latency and a 1.7× increase in generation throughput, enabled by its parallel decoding capability.
GLARE: Agentic Reasoning for Legal Judgment Prediction
Xinyu Yang | Chenlong Deng | Zhicheng Dou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinyu Yang | Chenlong Deng | Zhicheng Dou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Legal judgment prediction serves as a pivotal task in intelligent judicial systems. Although large language models have achieved remarkable progress in general reasoning, they struggle with tasks that require fine-grained distinctions between similar charges. These models often select plausible charges directly without discriminating among closely related alternatives. In this paper, we introduce GLARE, an agentic legal reasoning framework that enables models to actively retrieve and apply external knowledge during decision-making. Unlike static prediction, GLARE simulates comparative reasoning by dynamically expanding the decision space to include confusing candidates, then retrieving exclusionary logic from precedents and statutes to identify the correct judgment. Experiments on real-world datasets show that our method significantly outperforms strong baselines, especially on complex cases involving confusing or rare charges. The code is available at https://anonymous.4open.science/r/GLARE-LJP-8EDF.
Polymorphic Universal Transformer
Yilong Chen | Zitian Gao | Yihao Xiao | Jason Klein Liu | Xinyu Yang | Yifan Luo | Haoming Luo | Zhengmao Ye | Tingwen Liu | Ran Tao | Bryan Dai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yilong Chen | Zitian Gao | Yihao Xiao | Jason Klein Liu | Xinyu Yang | Yifan Luo | Haoming Luo | Zhengmao Ye | Tingwen Liu | Ran Tao | Bryan Dai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Although the Universal Transformer (UT) mitigates the diminishing returns of standard LLM scaling by decoupling parameter count from depth, it remains constrained by linear computational costs and rigid weight-sharing mechanisms. These limitations lead to severe functional homogeneity, which subsequently induces over-smoothing, representation rank collapse, and degraded reasoning performance. In this work, we present the first systematic study of Compute Distribution Skew, identifying it as the primary driver of extrapolation failure. This is a pathological phenomenon in ultra-deep recurrent Transformers characterized by a disproportionate distribution of contributions across recurrent steps, resulting in distinct functional states during prefix and suffix processing phases. To address this challenge, we propose the Polymorphic Transformer, which aims to achieve functional polymorphism and depth sparsity within a shared-parameter framework. By integrating conditional sparse subspaces, SiLU Attention, and an uncertainty-aware depth scheduler, our architecture mitigates power-method collapse and effectively decouples logical depth from computational cost. Experiments demonstrate that our model significantly enhances representation rank and robustness, achieving complex reasoning performance comparable to baseline while reducing computation by 64.7%.
2025
System Report for CCL25-Eval Task 4: Prompting, Scheduling, and Arbitration Strategies for Chinese Factivity Inference
Liu Daohuan | Xia Lun | Yuxuan Zhang | Xinyu Yang | Fanzhen Kong
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Liu Daohuan | Xia Lun | Yuxuan Zhang | Xinyu Yang | Fanzhen Kong
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
This report presents the methodology and findings of prompting large language models (LLMs) for Chinese Factivity Inference (FI). We evaluated five LLMs, among which DeepSeek-R1 demonstrated the best overall performance. A combination of Chain-of-Thought (CoT), few-shot, and system-level instructions were combined for final prompting. Additionally, we introduced a pairwise task scheduling strategy and a multi-agent disagreement arbitration mechanism to further enhance inference quality. Experimental results show that the integration of prompting, scheduling, and arbitration strategies significantly improves performance, with DeepSeek-R1 achieving 91.7% overall accuracy on the evaluation set. The report also highlights findings regarding LLM behavior on FI tasks and outlines potential directions for future improvement.
Debiasing the Fine-Grained Classification Task in LLMs with Bias-Aware PEFT
Daiying Zhao | Xinyu Yang | Hang Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Daiying Zhao | Xinyu Yang | Hang Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fine-grained classification via LLMs is susceptible to more complex label biases compared to traditional classification tasks. Existing bias mitigation strategies, such as retraining, post-hoc adjustment, and parameter-efficient fine-tuning (PEFT) are primarily effective for simple classification biases, such as stereotypes, but fail to adequately address prediction propensity and discriminative ability biases. In this paper, we analyze these two bias phenomena and observe their progressive accumulation from intermediate to deeper layers within LLMs. To mitigate this issue, we propose a bias-aware optimization framework that incorporates two distinct label balance constraints with a PEFT strategy targeting an intermediate layer. Our approach adjusts less than 1% of the model’s parameters while effectively curbing bias amplification in deeper layers. Extensive experiments conducted across 12 datasets and 5 LLMs demonstrate that our method consistently outperforms or matches the performance of full-parameter fine-tuning and LoRA, achieving superior results with lower perplexity.
Quantifying Semantic Emergence in Language Models
Hang Chen | Xinyu Yang | Jiaying Zhu | Wenya Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hang Chen | Xinyu Yang | Jiaying Zhu | Wenya Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are widely recognized for their exceptional capacity to capture semantics meaning. Yet, there remains no established metric to quantify this capability. In this work, we introduce a quantitative metric, Information Emergence (IE), designed to measure LLMs’ ability to extract semantics from input tokens. We formalize “semantics” as the meaningful information abstracted from a sequence of tokens and quantify this by comparing the entropy reduction observed for a sequence of tokens (macro-level) and individual tokens (micro-level). To achieve this, we design a lightweight estimator to compute the mutual information at each transformer layer, which is agnostic to different tasks and language model architectures. We apply IE in both synthetic in-context learning (ICL) scenarios and natural sentence contexts. Experiments demonstrate informativeness and patterns about semantics. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights.
2023
How to Enhance Causal Discrimination of Utterances: A Case on Affective Reasoning
Hang Chen | Xinyu Yang | Jing Luo | Wenjing Zhu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Hang Chen | Xinyu Yang | Jing Luo | Wenjing Zhu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Our investigation into the Affective Reasoning in Conversation (ARC) task highlights the challenge of causal discrimination. Almost all existing models, including large language models (LLMs), excel at capturing semantic correlations within utterance embeddings but fall short in determining the specific causal relationships. To overcome this limitation, we propose the incorporation of i.i.d. noise terms into the conversation process, thereby constructing a structural causal model (SCM). It explores how distinct causal relationships of fitted embeddings can be discerned through independent conditions. To facilitate the implementation of deep learning, we introduce the cogn frameworks to handle unstructured conversation data, and employ an autoencoder architecture to regard the unobservable noise as learnable “implicit causes.” Moreover, we curate a synthetic dataset that includes i.i.d. noise. Through comprehensive experiments, we validate the effectiveness and interpretability of our approach. Our code is available in https://github.com/Zodiark-ch/mater-of-our-EMNLP2023-paper.
Search
Fix author
Co-authors
- Hang Chen 3
- Beidi Chen 2
- Arian Azarang 1
- Jianwen Chen 1
- YiLong Chen 1
- Mihai Christodorescu 1
- Bryan Dai 1
- Liu Daohuan 1
- Chenlong Deng 1
- Zhicheng Dou (窦志成) 1
- Zitian Gao 1
- Fanzhen Kong 1
- Yueh Z Lee 1
- Gang Li 1
- Lei Li 1
- Yun Li 1
- Jason Klein Liu 1
- Tingwen Liu 1
- Xia Lun 1
- Haoming Luo 1
- Jing Luo 1
- Yifan Luo 1
- Ravi Mangal 1
- Corina S. Pasareanu 1
- Yibo Peng 1
- James Song 1
- Ran Tao 1
- Wenya Wang 1
- Peng Xia 1
- Yihao Xiao 1
- Huaxiu Yao 1
- Zhengmao Ye 1
- Yuxuan Zhang 1
- Daiying Zhao 1
- Haizhong Zheng 1
- Hongtu Zhu 1
- Jiaying Zhu 1
- Wenjing Zhu 1