Renzhi Wang

Also published as: 任之


2024

pdf bib
基于中间层对齐的异构师生模型知识蒸馏(Knowledge distillation of heterogeneous teacher-student model with intermediate layer loss)
Feiyan Zhai (翟飞燕) | Renzhi Wang (王任之) | Piji Li (李丕绩)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“知识蒸馏技术作为大语言模型时代的一项前沿模型压缩策略,通过将复杂模型的知识有效迁移至简单模型,显著降低了模型的参数规模和计算成本。尽管如此,目前主流的生成式大语言模型蒸馏算法主要集中于优化师生模型间的最后输出层损失,而忽视了对模型中间层的探索。此外,针对中间层蒸馏的研究往往对师生模型的结构一致性有着严格的要求,无法处理异构模型间的蒸馏问题,从而存在明显的局限性。针对这些问题,我们提出了一种新的知识蒸馏算法:引入了中间层蒸馏损失的异构生成式师生大语言模型知识蒸馏算法。该算法首先提取师生模型的中间层信息作为蒸馏对象,随后通过专门设计的中间层映射规则和对齐模块,实现异构模型间基于中间层的知识对齐与损失计算。最后,联合优化各个蒸馏损失的比例。通过在五个相关数据集上进行实验验证,我们的方法在提高蒸馏效果方面展现出显著优势。”

pdf bib
LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models
Renzhi Wang | Piji Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) require continual knowledge updates to stay abreast of the ever-changing world facts, prompting the formulation of lifelong model editing task. While recent years have witnessed the development of various techniques for single and batch editing, these methods either fail to apply or perform sub-optimally when faced with lifelong editing. In this paper, we introduce LEMoE, an advanced Mixture of Experts (MoE) adaptor for lifelong model editing. We first analyze the factors influencing the effectiveness of conventional MoE adaptor in lifelong editing, including catastrophic forgetting, inconsistent routing and order sensitivity. Based on these insights, we propose a tailored module insertion method to achieve lifelong editing, incorporating a novel KV anchor routing to enhance routing consistency between training and inference stage, along with a concise yet effective clustering-based editing order planning. Experimental results demonstrate the effectiveness of our method in lifelong editing, surpassing previous model editing techniques while maintaining outstanding performance in batch editing task. Our code will be available.

pdf bib
Semantic are Beacons: A Semantic Perspective for Unveiling Parameter-Efficient Fine-Tuning in Knowledge Learning
Renzhi Wang | Piji Li
Findings of the Association for Computational Linguistics: ACL 2024

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of Large Language Models (LLMs) to various downstream applications. However, the effectiveness of the PEFT diminishes notably when downstream tasks require accurate learning of specific knowledge. In this paper, we adopt a semantic perspective to investigate this phenomenon, uncovering the reasons behind PEFT’s limitations in knowledge learning task. Our findings reveals that: (1) PEFT presents a notable risk of pushing the model away from the intended knowledge target; (2) multiple knowledge interfere with each other, and such interference suppresses the learning and expression of knowledge features. Based on these insights, we introduce a data filtering strategy to exclude data that is detrimental to knowledge learning and a re-weighted learning strategy to make the model attentive to semantic distance during knowledge learning. Experimental results demonstrate the effectiveness of the proposed method on open-source large language model, further validate the semantic challenge in PEFT, thus paving the way for future research.

2023

pdf bib
InfoDiffusion: Information Entropy Aware Diffusion Process for Non-Autoregressive Text Generation
Renzhi Wang | Jing Li | Piji Li
Findings of the Association for Computational Linguistics: EMNLP 2023

Diffusion models have garnered considerable interest in the field of text generation. Several studies have explored text diffusion models with different structures and applied them to various tasks, including named entity recognition and summarization. However, there exists a notable disparity between the “easy-first” text generation process of current diffusion models and the “keyword-first” natural text generation process of humans, which has received limited attention. To bridge this gap, we propose InfoDiffusion, a non-autoregressive text diffusion model. Our approach introduces a “keyinfo-first” generation strategy and incorporates a noise schedule based on the amount of text information. In addition, InfoDiffusion combines self-conditioning with a newly proposed partially noising model structure. Experimental results show that InfoDiffusion outperforms the baseline model in terms of generation quality and diversity, as well as exhibiting higher sampling efficiency.