Wenchao Yu


2024

pdf
Large Language Models Can Be Contextual Privacy Protection Learners
Yijia Xiao | Yiqiao Jin | Yushi Bai | Yue Wu | Xianjun Yang | Xiao Luo | Wenchao Yu | Xujiang Zhao | Yanchi Liu | Quanquan Gu | Haifeng Chen | Wei Wang | Wei Cheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The proliferation of Large Language Models (LLMs) has driven considerable interest in fine-tuning them with domain-specific data to create specialized language models. Nevertheless, such domain-specific fine-tuning data often contains contextually sensitive personally identifiable information (PII). Direct fine-tuning LLMs on this data without privacy protection poses a risk of data leakage of sensitive PII during inference time. To address this challenge, we introduce Contextual Privacy Protection Language Models (CPPLM), a novel paradigm for fine-tuning LLMs that effectively injects domain-specific knowledge while safeguarding inference-time data privacy. Our work offers a theoretical analysis for model design and delves into various techniques such as corpus curation, penalty-based unlikelihood in training loss, and instruction-based tuning, etc. Extensive experiments across diverse datasets and scenarios demonstrate the effectiveness of our approaches. In particular, instruction tuning with both positive and negative examples, stands out as a promising method, effectively protecting private data while enhancing the model’s knowledge. Our work underscores the potential for Large Language Models as robust contextual privacy protection learners.

pdf
InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration
Fali Wang | Runxue Bao | Suhang Wang | Wenchao Yu | Yanchi Liu | Wei Cheng | Haifeng Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) have achieved exceptional capabilities in open generation across various domains, yet they encounter difficulties with tasks that require intensive knowledge. To address these challenges, methods for integrating knowledge have been developed, which augment LLMs with domain-specific knowledge graphs through external modules. These approaches, however, face data inefficiency issues as they necessitate the processing of both known and unknown knowledge for fine-tuning. Thus, our research focuses on a novel problem: efficiently integrating unknown knowledge into LLMs without unnecessary overlap of known knowledge. A risk of introducing new knowledge is the potential forgetting of existing knowledge. To mitigate this risk, we propose the innovative InfuserKI framework. This framework employs transformer internal states to determine when to enrich LLM outputs with additional information, effectively preventing knowledge forgetting. Performance evaluations using the UMLS-2.5k and MetaQA domain knowledge graphs reveal that InfuserKI not only successfully integrates new knowledge but also outperforms state-of-the-art baselines, reducing knowledge forgetting by 9% and 6%, respectively.