Jun Sun
Papers on this page may belong to the following people: Jun Sun, Jun Sun
2026
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models
Wei Zhao | Zhe Li | Yige Li | Jun Sun
Findings of the Association for Computational Linguistics: EACL 2026
Wei Zhao | Zhe Li | Yige Li | Jun Sun
Findings of the Association for Computational Linguistics: EACL 2026
Despite significant ongoing efforts in safety alignment, large language models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks that can induce harmful behaviors, including through the use of adversarial suffixes. Building on prior research, we hypothesize that these adversarial suffixes are not mere bugs but may represent features that can dominate the LLM’s behavior. To evaluate this hypothesis, we conduct several experiments. First, we demonstrate that benign features can be effectively made to function as adversarial suffixes, i.e., we develop a feature extraction method to extract sample-agnostic features from benign dataset in the form of suffixes and show that these suffixes may effectively compromise safety alignment. Second, we show that adversarial suffixes generated from jailbreak attacks may contain meaningful features, i.e., appending the same suffix to different prompts results in responses exhibiting specific characteristics. Third, we show that such benign-yet-safety-compromising features can be easily introduced through fine-tuning using only benign datasets. As a result, we are able to completely eliminate GPT’s safety alignment in a blackbox setting through finetuning with only benign data. Our code and data is available at anonymous.4open.science/r/suffix-maybe-features-D17C/.
2025
Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs
Wei Zhao | Zhe Li | Yige Li | Jun Sun
Findings of the Association for Computational Linguistics: EMNLP 2025
Wei Zhao | Zhe Li | Yige Li | Jun Sun
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Vision-Language Models (LVLMs) have made significant strides in multimodal comprehension, thanks to extensive pre-training and fine-tuning on large-scale visual datasets. However, despite their robust textual safety mechanisms, they remain vulnerable to harmful visual inputs. Existing safeguards—typically relying on pre-filtering or fine-tuning—incur high costs and diminish overall utility. To address this critical vulnerability, we introduce SafeCLIP, a lightweight method that leverages LVLMs’ inherent multimodal alignment for zero-shot toxic image detection. By projecting CLIP’s discarded CLS token into its text space and matching it with toxic descriptors, SafeCLIP detects harmful content without any architectural changes—adding minimal latency and enabling dynamic safety corrections during inference and fine-tuning. Experiments show that SafeCLIP achieves a 66.9% defense success rate with only 3.2% false positive rate and 7.2% overhead. In contrast, state-of-the-art methods achieve 52.9% success but have a 10.7% false positive rate and 210% overhead. Our work demonstrates that leveraging inherent multimodal alignment can yield efficient, low-cost LVLM safety. Code is available at anonymous.4open.science/r/safeclip-2C01.
Do Influence Functions Work on Large Language Models?
Zhe Li | Wei Zhao | Yige Li | Jun Sun
Findings of the Association for Computational Linguistics: EMNLP 2025
Zhe Li | Wei Zhao | Yige Li | Jun Sun
Findings of the Association for Computational Linguistics: EMNLP 2025
Influence functions are important for quantifying the impact of individual training data points on a model’s predictions. Although extensive research has been conducted on influence functions in traditional machine learning models, their application to large language models (LLMs) has been limited. In this work, we conduct a systematic study to address a key question: do influence functions work on LLMs? Specifically, we evaluate influence functions across multiple tasks and find that they consistently perform poorly in most settings. Our further investigation reveals that their poor performance can be attributed to: (1) inevitable approximation errors when estimating the iHVP component due to the scale of LLMs, (2) uncertain convergence during fine-tuning, and, more fundamentally, (3) the definition itself, as changes in model parameters do not necessarily correlate with changes in LLM behavior. Thus, our study suggests the need for alternative approaches for identifying influential samples.
2024
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
Wei Zhao | Zhe Li | Yige Li | Ye Zhang | Jun Sun
Findings of the Association for Computational Linguistics: EMNLP 2024
Wei Zhao | Zhe Li | Yige Li | Ye Zhang | Jun Sun
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs respond to harmful prompts and propose a novel defense method termed Layer-specific Editing (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical safety layers exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from identified toxic layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at https://github.com/ledllm/ledllm.
2016
Automatic Identifying Entity Type in Linked Data
Qingliang Miao | Ruiyu Fang | Shuangyong Song | Zhongguang Zheng | Lu Fang | Yao Meng | Jun Sun
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters
Qingliang Miao | Ruiyu Fang | Shuangyong Song | Zhongguang Zheng | Lu Fang | Yao Meng | Jun Sun
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters
2015
Feature Reduction Using Ensemble Approach
Yingju Xia | Cuiqin Hou | Zhuoran Xu | Jun Sun
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters
Yingju Xia | Cuiqin Hou | Zhuoran Xu | Jun Sun
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters
2010
Discriminative Induction of Sub-Tree Alignment using Limited Labeled Data
Jun Sun | Min Zhang | Chew Lim Tan
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
Jun Sun | Min Zhang | Chew Lim Tan
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)
Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels
Jun Sun | Min Zhang | Chew Lim Tan
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Jun Sun | Min Zhang | Chew Lim Tan
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
2009
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation
Jun Sun | Min Zhang | Chew Lim Tan
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
Jun Sun | Min Zhang | Chew Lim Tan
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
2007
A tree-to-tree alignment-based model for statistical machine translation
Min Zhang | Hongfei Jiang | Ai Ti Aw | Jun Sun | Sheng Li | Chew Lim Tan
Proceedings of Machine Translation Summit XI: Papers
Min Zhang | Hongfei Jiang | Ai Ti Aw | Jun Sun | Sheng Li | Chew Lim Tan
Proceedings of Machine Translation Summit XI: Papers
I2R Chinese-English translation system for IWSLT 2007
Boxing Chen | Jun Sun | Hongfei Jiang | Min Zhang | Ai Ti Aw
Proceedings of the Fourth International Workshop on Spoken Language Translation
Boxing Chen | Jun Sun | Hongfei Jiang | Min Zhang | Ai Ti Aw
Proceedings of the Fourth International Workshop on Spoken Language Translation
In this paper, we describe the system and approach used by Institute for Infocomm Research (I2R) for the IWSLT 2007 spoken language evaluation campaign. A multi-pass approach is exploited to generate and select best translation. First, we use two decoders namely the open source Moses and an in-home syntax-based decoder to generate N-best lists. Next we spawn new translation entries through a word-based n-gram language model estimated on the former N-best entries. Finally, we join the N-best lists from the previous two passes, and select the best translation by rescoring them with additional feature functions. In particular, this paper reports our effort on new translation entry generation and system combination. The performance on development and test sets are reported. The system was ranked first with respect to the BLEU measure in Chinese-to-English open data track.