Ian Harris

2025

pdf bib abs
Summary the Savior: Harmful Keyword and Query-based Summarization for LLM Jailbreak Defense
Shagoto Rahman | Ian Harris
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)

Large Language Models (LLMs) are widely used for their capabilities, but face threats from jailbreak attacks, which exploit LLMs to generate inappropriate information and bypass their defense system. Existing defenses are often specific to jailbreak attacks and as a result, a robust, attack-independent solution is needed to address both Natural Language Processing (NLP) ambiguities and attack variability. In this study, we have introduced, Summary The Savior, a novel jailbreak detection mechanism leveraging harmful keywords and query-based security-aware summary classification. By analyzing the illegal and improper contents of prompts within the summaries, the proposed method remains robust against attack diversity and NLP ambiguities. Two novel datasets for harmful keyword extraction and security aware summaries utilizing GPT-4 and Llama-3.1 70B respectively have been generated in this regard. Moreover, an “ambiguous harmful” class has been introduced to address content and intent ambiguities. Evaluation results demonstrate that, Summary The Savior achieves higher defense performance, outperforming state-of-the-art defense mechanisms namely Perplexity Filtering, SmoothLLM, Erase and Check with lowest attack success rates across various jailbreak attacks namely PAIR, GCG, JBC and Random Search, on Llama-2, Vicuna-13B and GPT-4. Our codes, models, and results are available at: https://github.com/shrestho10/SummaryTheSavior

2024

pdf bib abs
Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield
Jinhwa Kim | Ali Derakhshan | Ian Harris
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)

Large Language Models’ safety remains a critical concern due to their vulnerability to jailbreaking attacks, which can prompt these systems to produce harmful and malicious responses. Safety classifiers, computational models trained to discern and mitigate potentially harmful, offensive, or unethical outputs, offer a practical solution to address this issue. However, despite their potential, existing safety classifiers often fail when exposed to adversarial attacks such as gradient-optimized suffix attacks. In response, our study introduces Adversarial Prompt Shield (APS), a lightweight safety classifier model that excels in detection accuracy and demonstrates resilience against unseen jailbreaking prompts. We also introduce efficiently generated adversarial training datasets, named Bot Adversarial Noisy Dialogue (BAND), which are designed to fortify the classifier’s robustness. Through extensive testing on various safety tasks and unseen jailbreaking attacks, we demonstrate the effectiveness and resilience of our models. Evaluations show that our classifier has the potential to significantly reduce the Attack Success Rate by up to 44.9%. This advance paves the way for the next generation of more reliable and resilient Large Language Models.

2023

pdf bib abs
GAP-Gen: Guided Automatic Python Code Generation
Junchen Zhao | Yurun Song | Junlin Wang | Ian Harris
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Automatic code generation from natural language descriptions can be highly beneficial during the process of software development. In this work, we propose GAP-Gen, a Guided Automatic Python Code Generation method based on Python syntactic constraints and semantic constraints. We first introduce Python syntactic constraints in the form of Syntax-Flow, which is a simplified version of Abstract Syntax Tree (AST) reducing the size and high complexity of Abstract Syntax Tree but maintaining crucial syntactic information of Python code. In addition to Syntax-Flow, we introduce Variable-Flow which abstracts variable and function names consistently through out the code. In our work, rather than pretraining, we focus on modifying the finetuning process which reduces computational requirements but retains high generation performance on automatic Python code generation task. GAP-Gen fine-tunes the transformer based language models T5 and CodeT5 using the Code-to-Docstring datasets CodeSearchNet, CodeSearchNet AdvTest and Code-Docstring Corpus from EdinburghNLP. Our experiments show that GAP-Gen achieves better results on automatic Python code generation task than previous works

pdf bib abs
PCMID: Multi-Intent Detection through Supervised Prototypical Contrastive Learning
Yurun Song | Junchen Zhao | Spencer Koehler | Amir Abdullah | Ian Harris
Findings of the Association for Computational Linguistics: EMNLP 2023

Intent detection is a major task in Natural Language Understanding (NLU) and is the component of dialogue systems for interpreting users’ intentions based on their utterances. Many works have explored detecting intents by assuming that each utterance represents only a single intent. Such systems have achieved very good results; however, intent detection is a far more challenging task in typical real-world scenarios, where each user utterance can be highly complex and express multiple intents. Therefore, in this paper, we propose PCMID, a novel Multi-Intent Detection framework enabled by Prototypical Contrastive Learning under a supervised setting. The PCMID model can learn multiple semantic representations of a given user utterance under the context of different intent labels in an optimized semantic space. Our experiments show that PCMID achieves the current state-of-the-art performance on both multiple public benchmark datasets and a private real-world dataset for the multi-intent detection task.

2022

pdf bib abs
Mitra Behzadi at SemEval-2022 Task 5 : Multimedia Automatic Misogyny Identification method based on CLIP
Mitra Behzadi | Ali Derakhshan | Ian Harris
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Everyday more users are using memes on social media platforms to convey a message with text and image combined. Although there are many fun and harmless memes being created and posted, there are also ones that are hateful and offensive to particular groups of people. In this article present a novel approach based on the CLIP network to detect misogynous memes and find out the types of misogyny in that meme. We participated in Task A and Task B of the Multimedia Automatic Misogyny Identification (MaMi) challenge and our best scores are 0.694 and 0.681 respectively.

2021

pdf bib abs
Spoken Language Understanding for Task-oriented Dialogue Systems with Augmented Memory Networks
Jie Wu | Ian Harris | Hongzhi Zhao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Spoken language understanding, usually including intent detection and slot filling, is a core component to build a spoken dialog system. Recent research shows promising results by jointly learning of those two tasks based on the fact that slot filling and intent detection are sharing semantic knowledge. Furthermore, attention mechanism boosts joint learning to achieve state-of-the-art results. However, current joint learning models ignore the following important facts: 1. Long-term slot context is not traced effectively, which is crucial for future slot filling. 2. Slot tagging and intent detection could be mutually rewarding, but bi-directional interaction between slot filling and intent detection remains seldom explored. In this paper, we propose a novel approach to model long-term slot context and to fully utilize the semantic correlation between slots and intents. We adopt a key-value memory network to model slot context dynamically and to track more important slot tags decoded before, which are then fed into our decoder for slot tagging. Furthermore, gated memory information is utilized to perform intent detection, mutually improving both tasks through global optimization. Experiments on benchmark ATIS and Snips datasets show that our model achieves state-of-the-art performance and outperforms other methods, especially for the slot filling task.

Ian Harris

Fixing paper assignments

2025

2024

2023

2022

2021

2003

Co-authors

Venues