Yuping Lin
2025
Towards Knowledge Checking in Retrieval-augmented Generation: A Representation Perspective
Shenglai Zeng
|
Jiankun Zhang
|
Bingheng Li
|
Yuping Lin
|
Tianqi Zheng
|
Dante Everaert
|
Hanqing Lu
|
Hui Liu
|
Hui Liu
|
Yue Xing
|
Monica Xiao Cheng
|
Jiliang Tang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) systems have shown promise in enhancing the performance of Large Language Models (LLMs). However, these systems face challenges in effectively integrating external knowledge with the LLM’s internal knowledge, often leading to issues with misleading or unhelpful information. This work aims to provide a systematic study on knowledge checking in RAG systems. We conduct a comprehensive analysis of LLM representation behaviors and demonstrate the significance of using representations in knowledge checking. Motivated by the findings, we further develop representation-based classifiers for knowledge filtering. We show substantial improvements in RAG performance, even when dealing with noisy knowledge databases. Our study provides new insights into leveraging LLM representations for enhancing the reliability and effectiveness of RAG systems.
2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin
|
Pengfei He
|
Han Xu
|
Yue Xing
|
Makoto Yamada
|
Hui Liu
|
Jiliang Tang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM’s representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
Search
Fix data
Co-authors
- Hui Liu 3
- Jiliang Tang 2
- Yue Xing 2
- Monica Xiao Cheng 1
- Dante Everaert 1
- show all...