Tao Wei
2026
Revisiting the Reliability of Language Models in Instruction-Following
Jianshuo Dong | Yutong Zhang | Liu Yan | Zhenyu Zhong | Tao Wei | Chao Zhang | Han Qiu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianshuo Dong | Yutong Zhang | Liu Yan | Zhenyu Zhong | Tao Wei | Chao Zhang | Han Qiu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability—their performance can drop by up to 61.8% with nuanced prompt modifications. What’s more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.
CoTrust: Privacy-Preserving Collaboration Between Large and Small Language Models in Trusted Execution Environments
Zhenya Ma | Tingyi Wang | Yongheng Deng | Ziqing Qiao | Yinggui Wang | Tao Wei | Lei Wang | Ju Ren
Findings of the Association for Computational Linguistics: ACL 2026
Zhenya Ma | Tingyi Wang | Yongheng Deng | Ziqing Qiao | Yinggui Wang | Tao Wei | Lei Wang | Ju Ren
Findings of the Association for Computational Linguistics: ACL 2026
Services powered by large language models (LLMs) provide powerful text generation capabilities, but accessing sensitive user inputs raises serious privacy concerns. Trusted Execution Environments (TEEs) provide a secure computation environment, enabling sensitive inputs to be safely processed. However, directly deploying high-capacity LLMs in TEEs is often prohibitively expensive due to computation and memory constraints. To reconcile privacy, efficiency, and generation quality, we propose CoTrust, a privacy-preserving collaborative inference framework that combines LLMs with small language models (SLMs) inside TEE. CoTrust uses multiple de-identified views to let the LLM produce a consensus scaffold capturing answer reasoning without exposing private information, which the SLM then grounds in the full input to generate the final response. Experiments on multiple question answering and summarization benchmarks show that CoTrust approaches the performance of unconstrained LLMs, outperforms existing privacy-preserving baselines, and maintains strong privacy protection, while remaining efficient in a TDX-based TEE implementation.
2025
“I’ve Decided to Leak”: Probing Internals Behind Prompt Leakage Intents
Jianshuo Dong | Yutong Zhang | Liu Yan | Zhenyu Zhong | Tao Wei | Ke Xu | Minlie Huang | Chao Zhang | Han Qiu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jianshuo Dong | Yutong Zhang | Liu Yan | Zhenyu Zhong | Tao Wei | Ke Xu | Minlie Huang | Chao Zhang | Han Qiu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) exhibit prompt leakage vulnerabilities, where they may be coaxed into revealing system prompts embedded in LLM services, raising intellectual property and confidentiality concerns. An intriguing question arises: Do LLMs genuinely internalize prompt leakage intents in their hidden states before generating tokens? In this work, we use probing techniques to capture LLMs’ intent-related internal representations and confirm that the answer is yes. We start by comprehensively inducing prompt leakage behaviors across diverse system prompts, attack queries, and decoding methods. We develop a hybrid labeling pipeline, enabling the identification of broader prompt leakage behaviors beyond mere verbatim leaks. Our results show that a simple linear probe can predict prompt leakage risks from pre-generation hidden states without generating any tokens. Across all tested models, linear probes consistently achieve 90%+ AUROC, even when applied to new system prompts and attacks. Understanding the model internals behind prompt leakage drives practical applications, including intention-based detection of prompt leakage risks. Code is available at: https://github.com/jianshuod/Probing-leak-intents.